Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-create-connection-qlik-engine-tip
Amey Varangaonkar
13 Jun 2018
8 min read
Save for later

5 ways to create a connection to the Qlik Engine [Tip]

Amey Varangaonkar
13 Jun 2018
8 min read
With mashups or web apps, the Qlik Engine sits outside of your project and is not accessible and loaded by default. The first step before doing anything else is to create a connection with the Qlik Engine, after which you can continue to open a session and perform further actions on that app, such as: Opening a document/app Making selections Retrieving visualizations and apps For using the Qlik Engine API, open a WebSocket to the engine. There may be a difference in the way you do this, depending on whether you are working with Qlik Sense Enterprise or Qlik Sense Desktop. In this article, we will elaborate on how you can achieve a connection to the Qlik engine and the benefits of doing so. The following excerpt has been taken from the book Mastering Qlik Sense, authored by Martin Mahler and Juan Ignacio Vitantonio. Creating a connection To create a connection using WebSockets, you first need to establish a new web socket communication line. To open a WebSocket to the engine, use one of the following URIs: Qlik Sense Enterprise Qlik Sense Desktop wss://server.domain.com:4747/app/ or wss://server.domain.com[/virtual proxy]/app/ ws://localhost:4848/app Creating a Connection using WebSockets In the case of Qlik Sense Desktop, all you need to do is define a WebSocket variable, including its connection string in the following way: var ws = new WebSocket("ws://localhost:4848/app/"); Once the connection is opened and checking for ws.open(), you can call additional methods to the engine using ws.send(). This example will retrieve the number of available documents in my Qlik Sense Desktop environment, and append them to an HTML list: <html> <body> <ul id='docList'> </ul> </body> </html> <script> var ws = new WebSocket("ws://localhost:4848/app/"); var request = { "handle": -1, "method": "GetDocList", "params": {}, "outKey": -1, "id": 2 } ws.onopen = function(event){ ws.send(JSON.stringify(request)); // Receive the response ws.onmessage = function (event) { var response = JSON.parse(event.data); if(response.method != ' OnConnected'){ var docList = response.result.qDocList; var list = ''; docList.forEach(function(doc){ list += '<li>'+doc.qDocName+'</li>'; }) document.getElementById('docList').innerHTML = list; } } } </script> The preceding example will produce the following output on your browser if you have Qlik Sense Desktop running in the background: All Engine methods and calls can be tested in a user-friendly way by exploring the Qlik Engine in the Dev Hub. A single WebSocket connection can be associated with only one engine session (consisting of the app context, plus the user). If you need to work with multiple apps, you must open a separate WebSocket for each one. If you wish to create a WebSocket connection directly to an app, you can extend the configuration URL to include the application name, or in the case of the Qlik Sense Enterprise, the GUID. You can then use the method from the app class and any other classes as you continue to work with objects within the app. var ws = new WebSocket("ws://localhost:4848/app/MasteringQlikSense.qvf"); Creating  Connection to the Qlik Server Engine Connecting to the engine on a Qlik Sense environment is a little bit different as you will need to take care of authentication first. Authentication is handled in different ways, depending on how you have set up your server configuration, with the most common ones being: Ticketing Certificates Header authentication Authentication also depends on where the code that is interacting with the Qlik Engine is running. If your code is running on a trusted computer, authentication can be performed in several ways, depending on how your installation is configured and where the code is running: If you are running the code from a trusted computer, you can use certificates, which first need to be exported via the QMC If the code is running on a web browser, or certificates are not available, then you must authenticate via the virtual proxy of the server Creating a connection using certificates Certificates can be considered as a seal of trust, which allows you to communicate with the Qlik Engine directly with full permission. As such, only backend solutions ever have access to certificates, and you should guard how you distribute them carefully. To connect using certificates, you first need to export them via the QMC, which is a relatively easy thing to do: Once they are exported, you need to copy them to the folder where your project is located using the following code: <html> <body> <h1>Mastering QS</h1> </body> <script> var certPath = path.join('C:', 'ProgramData', 'Qlik', 'Sense', 'Repository', 'Exported Certificates', '.Local Certificates'); var certificates = { cert: fs.readFileSync(path.resolve(certPath, 'client.pem')), key: fs.readFileSync(path.resolve(certPath, 'client_key.pem')), root: fs.readFileSync(path.resolve(certPath, 'root.pem')) }; // Open a WebSocket using the engine port (rather than going through the proxy) var ws = new WebSocket('wss://server.domain.com:4747/app/', { ca: certificates.root, cert: certificates.cert, key: certificates.key, headers: { 'X-Qlik-User': 'UserDirectory=internal; UserId=sa_engine' } }); ws.onopen = function (event) { // Call your methods } </script> Creating a connection using the Mashup API Now, while connecting to the engine is a fundamental step to start interacting with Qlik, it's very low-level, connecting via WebSockets. For advanced use cases, the Mashup API is one way to help you get up to speed with a more developer-friendly abstraction layer. The Mashup API utilizes the qlik interface as an external interface to Qlik Sense, used for mashups and for including Qlik Sense objects in external web pages. To load the qlik module, you first need to ensure RequireJS is available in your main project file. You will then have to specify the URL of your Qlik Sense environment, as well as the prefix of the virtual proxy, if there is one: <html> <body> <h1>Mastering QS</h1> </body> </html> <script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.3.5/require.min.js"> <script> //Prefix is used for when a virtual proxy is used with the browser. var prefix = window.location.pathname.substr( 0, window.location.pathname.toLowerCase().lastIndexOf( "/extensions" ) + 1 ); //Config for retrieving the qlik.js module from the Qlik Sense Server var config = { host: window.location.hostname, prefix: prefix, port: window.location.port, isSecure: window.location.protocol === "https:" }; require.config({ baseUrl: (config.isSecure ? "https://" : "http://" ) + config.host + (config.port ? ":" + config.port : "" ) + config.prefix + "resources" }); require(["js/qlik"], function (qlik) { qlik.setOnError( function (error) { console.log(error); }); //Open an App var app = qlik.openApp('MasteringQlikSense.qvf', config); </script> Once you have created the connection to an app, you can start leveraging the full API by conveniently creating HyperCubes, connecting to fields, passing selections, retrieving objects, and much more. The Mashup API is intended for browser-based projects where authentication is handled in the same way as if you were going to open Qlik Sense. If you wish to use the Mashup API, or some parts of it, with a backend solution, you need to take care of authentication first. Creating a connection using enigma.js Enigma is Qlik's open-source promise wrapper for the engine. You can use enigma directly when you're in the Mashup API, or you can load it as a separate module. When you are writing code from within the Mashup API, you can retrieve the correct schema directly from the list of available modules which are loaded together with qlik.js via 'autogenerated/qix/engine-api'.   The following example will connect to a Demo App using enigma.js: define(function () { return function () { require(['qlik','enigma','autogenerated/qix/engine-api'], function (qlik, enigma, schema) { //The base config with all details filled in var config = { schema: schema, appId: "My Demo App.qvf", session:{ host:"localhost", port: 4848, prefix: "", unsecure: true, }, } //Now that we have a config, use that to connect to the //QIX service. enigma.getService("qix" , config).then(function(qlik){ qlik.global.openApp(config.appId) //Open App qlik.global.openApp(config.appId).then(function(app){ //Create SessionObject for FieldList app.createSessionObject( { qFieldListDef: { qShowSystem: false, qShowHidden: false, qShowSrcTables: true, qShowSemantic: true, qShowDerivedFields: true }, qInfo: { qId: "FieldList", qType: "FieldList" } } ).then( function(list) { return list.getLayout(); } ).then( function(listLayout) { return listLayout.qFieldList.qItems; } ).then( function(fieldItems) { console.log(fieldItems) } ); }) } })}}) It's essential to also load the correct schema whenever you load enigma.js. The schema is a collection of the available API methods that can be utilized in each version of Qlik Sense. This means your schema needs to be in sync with your QS version. Thus, we see it is fairly easy to create a stable connection with the Qlik Engine API. If you liked the above excerpt, make sure you check out the book Mastering Qlik Sense to learn more tips and tricks on working with different kinds of data using Qlik Sense and extract useful business insights. How Qlik Sense is driving self-service Business Intelligence Overview of a Qlik Sense® Application’s Life Cycle What we learned from Qlik Qonnections 2018
Read more
  • 0
  • 0
  • 17925

article-image-classifying-real-world-examples
Packt
24 Mar 2015
32 min read
Save for later

Classifying with Real-world Examples

Packt
24 Mar 2015
32 min read
This article by the authors, Luis Pedro Coelho and Willi Richert, of the book, Building Machine Learning Systems with Python - Second Edition, focuses on the topic of classification. (For more resources related to this topic, see here.) You have probably already used this form of machine learning as a consumer, even if you were not aware of it. If you have any modern e-mail system, it will likely have the ability to automatically detect spam. That is, the system will analyze all incoming e-mails and mark them as either spam or not-spam. Often, you, the end user, will be able to manually tag e-mails as spam or not, in order to improve its spam detection ability. This is a form of machine learning where the system is taking examples of two types of messages: spam and ham (the typical term for "non spam e-mails") and using these examples to automatically classify incoming e-mails. The general method of classification is to use a set of examples of each class to learn rules that can be applied to new examples. This is one of the most important machine learning modes and is the topic of this article. Working with text such as e-mails requires a specific set of techniques and skills. For the moment, we will work with a smaller, easier-to-handle dataset. The example question for this article is, "Can a machine distinguish between flower species based on images?" We will use two datasets where measurements of flower morphology are recorded along with the species for several specimens. We will explore these small datasets using a few simple algorithms. At first, we will write classification code ourselves in order to understand the concepts, but we will quickly switch to using scikit-learn whenever possible. The goal is to first understand the basic principles of classification and then progress to using a state-of-the-art implementation. The Iris dataset The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. The dataset is a collection of morphological measurements of several Iris flowers. These measurements will enable us to distinguish multiple species of the flowers. Today, species are identified by their DNA fingerprints, but in the 1930s, DNA's role in genetics had not yet been discovered. The following four attributes of each plant were measured: sepal length sepal width petal length petal width In general, we will call the individual numeric measurements we use to describe our data features. These features can be directly measured or computed from intermediate data. This dataset has four features. Additionally, for each plant, the species was recorded. The problem we want to solve is, "Given these examples, if we see a new flower out in the field, could we make a good prediction about its species from its measurements?" This is the supervised learning or classification problem: given labeled examples, can we design a rule to be later applied to other examples? A more familiar example to modern readers who are not botanists is spam filtering, where the user can mark e-mails as spam, and systems use these as well as the non-spam e-mails to determine whether a new, incoming message is spam or not. For the moment, the Iris dataset serves our purposes well. It is small (150 examples, four features each) and can be easily visualized and manipulated. Visualization is a good first step Datasets will grow to thousands of features. With only four in our starting example, we can easily plot all two-dimensional projections on a single page. We will build intuitions on this small example, which can then be extended to large datasets with many more features. Visualizations are excellent at the initial exploratory phase of the analysis as they allow you to learn the general features of your problem as well as catch problems that occurred with data collection early. Each subplot in the following plot shows all points projected into two of the dimensions. The outlying group (triangles) are the Iris Setosa plants, while Iris Versicolor plants are in the center (circle) and Iris Virginica are plotted with x marks. We can see that there are two large groups: one is of Iris Setosa and another is a mixture of Iris Versicolor and Iris Virginica.   In the following code snippet, we present the code to load the data and generate the plot: >>> from matplotlib import pyplot as plt >>> import numpy as np   >>> # We load the data with load_iris from sklearn >>> from sklearn.datasets import load_iris >>> data = load_iris()   >>> # load_iris returns an object with several fields >>> features = data.data >>> feature_names = data.feature_names >>> target = data.target >>> target_names = data.target_names   >>> for t in range(3): ...   if t == 0: ...       c = 'r' ...       marker = '>' ...   elif t == 1: ...       c = 'g' ...       marker = 'o' ...   elif t == 2: ...       c = 'b' ...       marker = 'x' ...   plt.scatter(features[target == t,0], ...               features[target == t,1], ...               marker=marker, ...               c=c) Building our first classification model If the goal is to separate the three types of flowers, we can immediately make a few suggestions just by looking at the data. For example, petal length seems to be able to separate Iris Setosa from the other two flower species on its own. We can write a little bit of code to discover where the cut-off is: >>> # We use NumPy fancy indexing to get an array of strings: >>> labels = target_names[target]   >>> # The petal length is the feature at position 2 >>> plength = features[:, 2]   >>> # Build an array of booleans: >>> is_setosa = (labels == 'setosa')   >>> # This is the important step: >>> max_setosa =plength[is_setosa].max() >>> min_non_setosa = plength[~is_setosa].min() >>> print('Maximum of setosa: {0}.'.format(max_setosa)) Maximum of setosa: 1.9.   >>> print('Minimum of others: {0}.'.format(min_non_setosa)) Minimum of others: 3.0. Therefore, we can build a simple model: if the petal length is smaller than 2, then this is an Iris Setosa flower; otherwise it is either Iris Virginica or Iris Versicolor. This is our first model and it works very well in that it separates Iris Setosa flowers from the other two species without making any mistakes. In this case, we did not actually do any machine learning. Instead, we looked at the data ourselves, looking for a separation between the classes. Machine learning happens when we write code to look for this separation automatically. The problem of recognizing Iris Setosa apart from the other two species was very easy. However, we cannot immediately see what the best threshold is for distinguishing Iris Virginica from Iris Versicolor. We can even see that we will never achieve perfect separation with these features. We could, however, look for the best possible separation, the separation that makes the fewest mistakes. For this, we will perform a little computation. We first select only the non-Setosa features and labels: >>> # ~ is the boolean negation operator >>> features = features[~is_setosa] >>> labels = labels[~is_setosa] >>> # Build a new target variable, is_virginica >>> is_virginica = (labels == 'virginica') Here we are heavily using NumPy operations on arrays. The is_setosa array is a Boolean array and we use it to select a subset of the other two arrays, features and labels. Finally, we build a new boolean array, virginica, by using an equality comparison on labels. Now, we run a loop over all possible features and thresholds to see which one results in better accuracy. Accuracy is simply the fraction of examples that the model classifies correctly. >>> # Initialize best_acc to impossibly low value >>> best_acc = -1.0 >>> for fi in range(features.shape[1]): ... # We are going to test all possible thresholds ... thresh = features[:,fi] ... for t in thresh: ...   # Get the vector for feature `fi` ...   feature_i = features[:, fi] ...   # apply threshold `t` ...   pred = (feature_i > t) ...   acc = (pred == is_virginica).mean() ...   rev_acc = (pred == ~is_virginica).mean() ...   if rev_acc > acc: ...       reverse = True ...       acc = rev_acc ...   else: ...       reverse = False ... ...   if acc > best_acc: ...     best_acc = acc ...     best_fi = fi ...     best_t = t ...     best_reverse = reverse We need to test two types of thresholds for each feature and value: we test a greater than threshold and the reverse comparison. This is why we need the rev_acc variable in the preceding code; it holds the accuracy of reversing the comparison. The last few lines select the best model. First, we compare the predictions, pred, with the actual labels, is_virginica. The little trick of computing the mean of the comparisons gives us the fraction of correct results, the accuracy. At the end of the for loop, all the possible thresholds for all the possible features have been tested, and the variables best_fi, best_t, and best_reverse hold our model. This is all the information we need to be able to classify a new, unknown object, that is, to assign a class to it. The following code implements exactly this method: def is_virginica_test(fi, t, reverse, example):    "Apply threshold model to a new example"    test = example[fi] > t    if reverse:        test = not test    return test What does this model look like? If we run the code on the whole data, the model that is identified as the best makes decisions by splitting on the petal width. One way to gain intuition about how this works is to visualize the decision boundary. That is, we can see which feature values will result in one decision versus the other and exactly where the boundary is. In the following screenshot, we see two regions: one is white and the other is shaded in grey. Any datapoint that falls on the white region will be classified as Iris Virginica, while any point that falls on the shaded side will be classified as Iris Versicolor. In a threshold model, the decision boundary will always be a line that is parallel to one of the axes. The plot in the preceding screenshot shows the decision boundary and the two regions where points are classified as either white or grey. It also shows (as a dashed line) an alternative threshold, which will achieve exactly the same accuracy. Our method chose the first threshold it saw, but that was an arbitrary choice. Evaluation – holding out data and cross-validation The model discussed in the previous section is a simple model; it achieves 94 percent accuracy of the whole data. However, this evaluation may be overly optimistic. We used the data to define what the threshold will be, and then we used the same data to evaluate the model. Of course, the model will perform better than anything else we tried on this dataset. The reasoning is circular. What we really want to do is estimate the ability of the model to generalize to new instances. We should measure its performance in instances that the algorithm has not seen at training. Therefore, we are going to do a more rigorous evaluation and use held-out data. For this, we are going to break up the data into two groups: on one group, we'll train the model, and on the other, we'll test the one we held out of training. The full code, which is an adaptation of the code presented earlier, is available on the online support repository. Its output is as follows: Training accuracy was 96.0%. Testing accuracy was 90.0% (N = 50). The result on the training data (which is a subset of the whole data) is apparently even better than before. However, what is important to note is that the result in the testing data is lower than that of the training error. While this may surprise an inexperienced machine learner, it is expected that testing accuracy will be lower than the training accuracy. To see why, look back at the plot that showed the decision boundary. Consider what would have happened if some of the examples close to the boundary were not there or that one of them between the two lines was missing. It is easy to imagine that the boundary will then move a little bit to the right or to the left so as to put them on the wrong side of the border. The accuracy on the training data, the training accuracy, is almost always an overly optimistic estimate of how well your algorithm is doing. We should always measure and report the testing accuracy, which is the accuracy on a collection of examples that were not used for training. These concepts will become more and more important as the models become more complex. In this example, the difference between the accuracy measured on training data and on testing data is not very large. When using a complex model, it is possible to get 100 percent accuracy in training and do no better than random guessing on testing! One possible problem with what we did previously, which was to hold out data from training, is that we only used half the data for training. Perhaps it would have been better to use more training data. On the other hand, if we then leave too little data for testing, the error estimation is performed on a very small number of examples. Ideally, we would like to use all of the data for training and all of the data for testing as well, which is impossible. We can achieve a good approximation of this impossible ideal by a method called cross-validation. One simple form of cross-validation is leave-one-out cross-validation. We will take an example out of the training data, learn a model without this example, and then test whether the model classifies this example correctly. This process is then repeated for all the elements in the dataset. The following code implements exactly this type of cross-validation: >>> correct = 0.0 >>> for ei in range(len(features)):      # select all but the one at position `ei`:      training = np.ones(len(features), bool)      training[ei] = False      testing = ~training      model = fit_model(features[training], is_virginica[training])      predictions = predict(model, features[testing])      correct += np.sum(predictions == is_virginica[testing]) >>> acc = correct/float(len(features)) >>> print('Accuracy: {0:.1%}'.format(acc)) Accuracy: 87.0% At the end of this loop, we will have tested a series of models on all the examples and have obtained a final average result. When using cross-validation, there is no circularity problem because each example was tested on a model which was built without taking that datapoint into account. Therefore, the cross-validated estimate is a reliable estimate of how well the models would generalize to new data. The major problem with leave-one-out cross-validation is that we are now forced to perform many times more work. In fact, you must learn a whole new model for each and every example and this cost will increase as our dataset grows. We can get most of the benefits of leave-one-out at a fraction of the cost by using x-fold cross-validation, where x stands for a small number. For example, to perform five-fold cross-validation, we break up the data into five groups, so-called five folds. Then you learn five models: each time you will leave one fold out of the training data. The resulting code will be similar to the code given earlier in this section, but we leave 20 percent of the data out instead of just one element. We test each of these models on the left-out fold and average the results.   The preceding figure illustrates this process for five blocks: the dataset is split into five pieces. For each fold, you hold out one of the blocks for testing and train on the other four. You can use any number of folds you wish. There is a trade-off between computational efficiency (the more folds, the more computation is necessary) and accurate results (the more folds, the closer you are to using the whole of the data for training). Five folds is often a good compromise. This corresponds to training with 80 percent of your data, which should already be close to what you will get from using all the data. If you have little data, you can even consider using 10 or 20 folds. In the extreme case, if you have as many folds as datapoints, you are simply performing leave-one-out cross-validation. On the other hand, if computation time is an issue and you have more data, 2 or 3 folds may be the more appropriate choice. When generating the folds, you need to be careful to keep them balanced. For example, if all of the examples in one fold come from the same class, then the results will not be representative. We will not go into the details of how to do this, because the machine learning package scikit-learn will handle them for you. We have now generated several models instead of just one. So, "What final model do we return and use for new data?" The simplest solution is now to train a single overall model on all your training data. The cross-validation loop gives you an estimate of how well this model should generalize. A cross-validation schedule allows you to use all your data to estimate whether your methods are doing well. At the end of the cross-validation loop, you can then use all your data to train a final model. Although it was not properly recognized when machine learning was starting out as a field, nowadays, it is seen as a very bad sign to even discuss the training accuracy of a classification system. This is because the results can be very misleading and even just presenting them marks you as a newbie in machine learning. We always want to measure and compare either the error on a held-out dataset or the error estimated using a cross-validation scheme. Building more complex classifiers In the previous section, we used a very simple model: a threshold on a single feature. Are there other types of systems? Yes, of course! Many others. To think of the problem at a higher abstraction level, "What makes up a classification model?" We can break it up into three parts: The structure of the model: How exactly will a model make decisions? In this case, the decision depended solely on whether a given feature was above or below a certain threshold value. This is too simplistic for all but the simplest problems. The search procedure: How do we find the model we need to use? In our case, we tried every possible combination of feature and threshold. You can easily imagine that as models get more complex and datasets get larger, it rapidly becomes impossible to attempt all combinations and we are forced to use approximate solutions. In other cases, we need to use advanced optimization methods to find a good solution (fortunately, scikit-learn already implements these for you, so using them is easy even if the code behind them is very advanced). The gain or loss function: How do we decide which of the possibilities tested should be returned? Rarely do we find the perfect solution, the model that never makes any mistakes, so we need to decide which one to use. We used accuracy, but sometimes it will be better to optimize so that the model makes fewer errors of a specific kind. For example, in spam filtering, it may be worse to delete a good e-mail than to erroneously let a bad e-mail through. In that case, we may want to choose a model that is conservative in throwing out e-mails rather than the one that just makes the fewest mistakes overall. We can discuss these issues in terms of gain (which we want to maximize) or loss (which we want to minimize). They are equivalent, but sometimes one is more convenient than the other. We can play around with these three aspects of classifiers and get different systems. A simple threshold is one of the simplest models available in machine learning libraries and only works well when the problem is very simple, such as with the Iris dataset. In the next section, we will tackle a more difficult classification task that requires a more complex structure. In our case, we optimized the threshold to minimize the number of errors. Alternatively, we might have different loss functions. It might be that one type of error is much costlier than the other. In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests to confirm or unnecessary treatment (which can still have costs, including side effects from the treatment, but are often less serious than missing a diagnostic). Therefore, depending on the exact setting, different trade-offs can make sense. At one extreme, if the disease is fatal and the treatment is cheap with very few negative side-effects, then you want to minimize false negatives as much as you can. What the gain/cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes, achieving the highest accuracy. However, if some mistakes are costlier than others, it might be better to accept a lower overall accuracy to minimize the overall costs. A more complex dataset and a more complex classifier We will now look at a slightly more complex dataset. This will motivate the introduction of a new classification algorithm and a few other ideas. Learning about the Seeds dataset We now look at another agricultural dataset, which is still small, but already too large to plot exhaustively on a page as we did with Iris. This dataset consists of measurements of wheat seeds. There are seven features that are present, which are as follows: area A perimeter P compactness C = 4pA/P² length of kernel width of kernel asymmetry coefficient length of kernel groove There are three classes, corresponding to three wheat varieties: Canadian, Koma, and Rosa. As earlier, the goal is to be able to classify the species based on these morphological measurements. Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset and its features were automatically computed from digital images. This is how image pattern recognition can be implemented: you can take images, in digital form, compute a few relevant features from them, and use a generic classification system. For the moment, we will work with the features that are given to us. UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they list 233 datasets). Both the Iris and the Seeds dataset used in this article were taken from there. The repository is available online at http://archive.ics.uci.edu/ml/. Features and feature engineering One interesting aspect of these features is that the compactness feature is not actually a new measurement, but a function of the previous two features, area and perimeter. It is often very useful to derive new combined features. Trying to create new features is generally called feature engineering. It is sometimes seen as less glamorous than algorithms, but it often matters more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). In this case, the original researchers computed the compactness, which is a typical feature for shapes. It is also sometimes called roundness. This feature will have the same value for two kernels, one of which is twice as big as the other one, but with the same shape. However, it will have different values for kernels that are very round (when the feature is close to one) when compared to kernels that are elongated (when the feature is closer to zero). The goals of a good feature are to simultaneously vary with what matters (the desired output) and be invariant with what does not. For example, compactness does not vary with size, but varies with the shape. In practice, it might be hard to achieve both objectives perfectly, but we want to approximate this ideal. You will need to use background knowledge to design good features. Fortunately, for many problem domains, there is already a vast literature of possible features and feature-types that you can build upon. For images, all of the previously mentioned features are typical and computer vision libraries will compute them for you. In text-based problems too, there are standard solutions that you can mix and match. When possible, you should use your knowledge of the problem to design a specific feature or to select which ones from the literature are more applicable to the data at hand. Even before you have data, you must decide which data is worthwhile to collect. Then, you hand all your features to the machine to evaluate and compute the best classifier. A natural question is whether we can select good features automatically. This problem is known as feature selection. There are many methods that have been proposed for this problem, but in practice very simple ideas work best. For the small problems we are currently exploring, it does not make sense to use feature selection, but if you had thousands of features, then throwing out most of them might make the rest of the process much faster. Nearest neighbor classification For use with this dataset, we will introduce a new classifier: the nearest neighbor classifier. The nearest neighbor classifier is very simple. When classifying a new element, it looks at the training data for the object that is closest to it, its nearest neighbor. Then, it returns its label as the answer. Notice that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples with different labels have exactly the same feature values, which will indicate that the features you are using are not very descriptive). Therefore, it is essential to test the classification using a cross-validation protocol. The nearest neighbor method can be generalized to look not at a single neighbor, but to multiple ones and take a vote amongst the neighbors. This makes the method more robust to outliers or mislabeled data. Classifying with scikit-learn We have been using handwritten classification code, but Python is a very appropriate language for machine learning because of its excellent libraries. In particular, scikit-learn has become the standard library for many machine learning tasks, including classification. We are going to use its implementation of nearest neighbor classification in this section. The scikit-learn classification API is organized around classifier objects. These objects have the following two essential methods: fit(features, labels): This is the learning step and fits the parameters of the model predict(features): This method can only be called after fit and returns a prediction for one or more inputs Here is how we could use its implementation of k-nearest neighbors for our data. We start by importing the KneighborsClassifier object from the sklearn.neighbors submodule: >>> from sklearn.neighbors import KNeighborsClassifier The scikit-learn module is imported as sklearn (sometimes you will also find that scikit-learn is referred to using this short name instead of the full name). All of the sklearn functionality is in submodules, such as sklearn.neighbors. We can now instantiate a classifier object. In the constructor, we specify the number of neighbors to consider, as follows: >>> classifier = KNeighborsClassifier(n_neighbors=1) If we do not specify the number of neighbors, it defaults to 5, which is often a good choice for classification. We will want to use cross-validation (of course) to look at our data. The scikit-learn module also makes this easy: >>> from sklearn.cross_validation import KFold   >>> kf = KFold(len(features), n_folds=5, shuffle=True) >>> # `means` will be a list of mean accuracies (one entry per fold) >>> means = [] >>> for training,testing in kf: ...   # We fit a model for this fold, then apply it to the ...   # testing data with `predict`: ...   classifier.fit(features[training], labels[training]) ...   prediction = classifier.predict(features[testing]) ... ...   # np.mean on an array of booleans returns fraction ...     # of correct decisions for this fold: ...   curmean = np.mean(prediction == labels[testing]) ...   means.append(curmean) >>> print("Mean accuracy: {:.1%}".format(np.mean(means))) Mean accuracy: 90.5% Using five folds for cross-validation, for this dataset, with this algorithm, we obtain 90.5 percent accuracy. As we discussed in the earlier section, the cross-validation accuracy is lower than the training accuracy, but this is a more credible estimate of the performance of the model. Looking at the decision boundaries We will now examine the decision boundary. In order to plot these on paper, we will simplify and look at only two dimensions. Take a look at the following plot:   Canadian examples are shown as diamonds, Koma seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as white, black, and grey. You might be wondering why the regions are so horizontal, almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22, while the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x is actually much larger than a small change in y. So, when we compute the distance between points, we are, for the most part, only taking the x axis into account. This is also a good example of why it is a good idea to visualize our data and look for red flags or surprises. If you studied physics (and you remember your lessons), you might have already noticed that we had been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to z-scores. The z-score of a value is how far away from the mean it is, in units of standard deviation. It comes down to this operation: In this formula, f is the old feature value, f' is the normalized feature value, μ is the mean of the feature, and σ is the standard deviation. Both μ and σ are estimated from training data. Independent of what the original values were, after z-scoring, a value of zero corresponds to the training mean, positive values are above the mean, and negative values are below it. The scikit-learn module makes it very easy to use this normalization as a preprocessing step. We are going to use a pipeline of transformations: the first element will do the transformation and the second element will do the classification. We start by importing both the pipeline and the feature scaling classes as follows: >>> from sklearn.pipeline import Pipeline >>> from sklearn.preprocessing import StandardScaler Now, we can combine them. >>> classifier = KNeighborsClassifier(n_neighbors=1) >>> classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)]) The Pipeline constructor takes a list of pairs (str,clf). Each pair corresponds to a step in the pipeline: the first element is a string naming the step, while the second element is the object that performs the transformation. Advanced usage of the object uses these names to refer to different steps. After normalization, every feature is in the same units (technically, every feature is now dimensionless; it has no units) and we can more confidently mix dimensions. In fact, if we now run our nearest neighbor classifier, we obtain 93 percent accuracy, estimated with the same five-fold cross-validation code shown previously! Look at the decision space again in two dimensions:   The boundaries are now different and you can see that both dimensions make a difference for the outcome. In the full dataset, everything is happening on a seven-dimensional space, which is very hard to visualize, but the same principle applies; while a few dimensions are dominant in the original data, after normalization, they are all given the same importance. Binary and multiclass classification The first classifier we used, the threshold classifier, was a simple binary classifier. Its result is either one class or the other, as a point is either above the threshold value or it is not. The second classifier we used, the nearest neighbor classifier, was a natural multiclass classifier, its output can be one of the several classes. It is often simpler to define a simple binary method than the one that works on multiclass problems. However, we can reduce any multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset, in a haphazard way: we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions: Is it an Iris Setosa (yes or no)? If not, check whether it is an Iris Virginica (yes or no). Of course, we want to leave this sort of reasoning to the computer. As usual, there are several solutions to this multiclass reduction. The simplest is to use a series of one versus the rest classifiers. For each possible label ℓ, we build a classifier of the type is this ℓ or something else? When applying the rule, exactly one of the classifiers will say yes and we will have our solution. Unfortunately, this does not always happen, so we have to decide how to deal with either multiple positive answers or no positive answers.   Alternatively, we can build a classification tree. Split the possible labels into two, and build a classifier that asks, "Should this example go in the left or the right bin?" We can perform this splitting recursively until we obtain a single label. The preceding diagram depicts the tree of reasoning for the Iris dataset. Each diamond is a single binary classifier. It is easy to imagine that we could make this tree larger and encompass more decisions. This means that any classifier that can be used for binary classification can also be adapted to handle any number of classes in a simple way. There are many other possible ways of turning a binary method into a multiclass one. There is no single method that is clearly better in all cases. The scikit-learn module implements several of these methods in the sklearn.multiclass submodule. Some classifiers are binary systems, while many real-life problems are naturally multiclass. Several simple protocols reduce a multiclass problem to a series of binary decisions and allow us to apply the binary models to our multiclass problem. This means methods that are apparently only for binary data can be applied to multiclass data with little extra effort. Summary Classification means generalizing from examples to build a model (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine. In a sense, this was a very theoretical article, as we introduced generic concepts with simple examples. We went over a few operations with the Iris dataset. This is a small dataset. However, it has the advantage that we were able to plot it out and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The intuitions we gained here will all still be valid. You also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must, instead, evaluate it on testing data that has not been used for training. In order to not waste too many examples in testing, a cross-validation schedule can get us the best of both worlds (at the cost of more computation). We also had a look at the problem of feature engineering. Features are not predefined for you, but choosing and designing features is an integral part of designing a machine learning pipeline. In fact, it is often the area where you can get the most improvements in accuracy, as better data beats fancier methods. Resources for Article: Further resources on this subject: Ridge Regression [article] The Spark programming model [article] Using cross-validation [article]
Read more
  • 0
  • 0
  • 17882

article-image-working-with-kafka-streams
Amarabha Banerjee
22 Feb 2018
6 min read
Save for later

Working with Kafka Streams

Amarabha Banerjee
22 Feb 2018
6 min read
This article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book will simplify real-time data processing by leveraging Apache Kafka 1.0. In today’s tutorial we are going to discuss how to work with Apache Kafka Streams efficiently. In the data world, a stream is linked to the most important abstractions. A stream depicts a continuously updating and unbounded process. Here, unbounded means unlimited size. By definition, a stream is a fault-tolerant, replayable, and ordered sequence of immutable data records. A data record is defined as a key-value pair. Before we proceed, some concepts need to be defined: Stream processing application: Any program that utilizes the Kafka streams library is known as a stream processing application. Processor topology: This is a topology that defines the computational logic of the data processing that a stream processing application requires to be performed. A topology is a graph of stream processors (nodes) connected by streams (edges).  There are two ways to define a topology: Via the low-level processor API Via the Kafka streams DSL Stream processor: This is a node present in the processor topology. It represents a processing step in a topology and is used to transform data in streams. The standard operations—filter, join, map, and aggregations—are examples of stream processors available in Kafka streams. Windowing: Sometimes, data records are divided into time buckets by a stream processor to window the stream by time. This is usually required for aggregation and join operations. Join: When two or more streams are merged based on the keys of their data records, a new stream is generated. The operation that generates this new stream is called a join. A join over record streams is usually required to be performed on a windowing basis. Aggregation: A new stream is generated by combining multiple input records into a single output record, by taking one input stream. The operation that creates this new stream is known as aggregation. Examples of aggregations are sums and counts. Setting up the project This recipe sets the project to use Kafka streams in the Treu application project. Getting ready The project generated in the first four chapters is needed. How to do it Open the build.gradle file on the Treu project generated in Chapter 4, Message Enrichment, and add these lines: apply plugin: 'java' apply plugin: 'application' sourceCompatibility = '1.8' mainClassName = 'treu.StreamingApp' repositories { mavenCentral() } version = '0.1.0' dependencies { compile 'org.apache.kafka:kafka-clients:1.0.0' compile 'org.apache.kafka:kafka-streams:1.0.0' compile 'org.apache.avro:avro:1.7.7' } jar { manifest { attributes 'Main-Class': mainClassName } from { configurations.compile.collect { it.isDirectory() ? it : zipTree(it) } } { exclude "META-INF/*.SF" exclude "META-INF/*.DSA" exclude "META-INF/*.RSA" } } To rebuild the app, from the project root directory, run this command: $ gradle jar The output is something like: ... BUILD SUCCESSFUL Total time: 24.234 secs As the next step, create a file called StreamingApp.java in the src/main/java/treu directory with the following contents: package treu; import org.apache.kafka.streams.StreamsBuilder; import org.apache.kafka.streams.Topology; import org.apache.kafka.streams.KafkaStreams; import org.apache.kafka.streams.StreamsConfig; import org.apache.kafka.streams.kstream.KStream; import java.util.Properties; public class StreamingApp { public static void main(String[] args) throws Exception { Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streaming_app_id");// 1 props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); //2 StreamsConfig config = new StreamsConfig(props); // 3 StreamsBuilder builder = new StreamsBuilder(); //4 Topology topology = builder.build(); KafkaStreams streams = new KafkaStreams(topology, config); KStream<String, String> simpleFirstStream = builder.stream("src-topic"); //5 KStream<String, String> upperCasedStream = simpleFirstStream.mapValues(String::toUpperCase); //6 upperCasedStream.to("out-topic"); //7 System.out.println("Streaming App Started"); streams.start(); Thread.sleep(30000); //8 System.out.println("Shutting down the Streaming App"); streams.close(); } } How it works Follow the comments in the code: In line //1, the APPLICATION_ID_CONFIG is an identifier for the app inside the broker In line //2, the BOOTSTRAP_SERVERS_CONFIG specifies the broker to use In line //3, the StreamsConfig object is created, it is built with the properties specified In line //4, the StreamsBuilder object is created, it is used to build a topology In line //5, when KStream is created, the input topic is specified In line //6, another KStream is created with the contents of the src-topic but in uppercase In line //7, the uppercase stream should write the output to out-topic In line //8, the application will run for 30 seconds Running the streaming application In the previous recipe, the first version of the streaming app was coded. Now, in this recipe, everything is compiled and executed. Getting ready The execution of the previous recipe of this chapter is needed. How to do it The streaming app doesn't receive arguments from the command line: To build the project, from the treu directory, run the following command: $ gradle jar If everything is OK, the output should be: ... BUILD SUCCESSFUL Total time: … To run the project, we have four different command-line windows. The following diagram shows what the arrangement of command-line windows should look like: In the first command-line Terminal, run the control center: $ <confluent-path>/bin/confluent start In the second command-line Terminal, create the two topics needed: $ bin/kafka-topics --create --topic src-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1 $ bin/kafka-topics --create --topic out-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1 In that command-line Terminal, start the producer: $ bin/kafka-console-producer --broker-list localhost:9092 --topic src-topic This window is where the input messages are typed. In the third command-line Terminal, start a consumer script listening to outtopic: $ bin/kafka-console-consumer --bootstrap-server localhost:9092 -- from-beginning --topic out-topic In the fourth command-line Terminal, start up the processing application. Go the project root directory (where the Gradle jar command was executed) and run: $ java -jar ./build/libs/treu-0.1.0.jar localhost:9092 Go to the second command-line Terminal (console-producer) and send the following three messages (remember to press Enter between messages and execute each one in just one line): $> Hello [Enter] $> Kafka [Enter] $> Streams [Enter] The messages typed in console-producer should appear uppercase in the outtopic console consumer window: > HELLO > KAFKA > STREAMS We discussed about the Apache Kafka streams and how to get up and running with it. If you liked this post, be sure to check out Apache Kafka 1.0 Cookbook which consists of more useful recipes to work with Apache Kafka installation.  
Read more
  • 0
  • 0
  • 17880

article-image-implement-reinforcement-learning-tensorflow
Gebin George
05 Mar 2018
3 min read
Save for later

How to implement Reinforcement Learning with TensorFlow

Gebin George
05 Mar 2018
3 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials co-authored by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book will help you get to grips with the essentials of deep learning by leveraging the power of Python.[/box] In today’s tutorial, we will implement reinforcement learning with TensorFlow-based Qlearning algorithm. We will look at a popular game, FrozenLake, which has an inbuilt environment in the OpenAI gym package. The idea behind the FrozenLake game is quite simple. It consists of 4 x 4 grid blocks, where each block can have one of the following four states: S: Starting point/Safe state F: Frozen surface/Safe state H: Hole/Unsafe state G: Goal/Safe or Terminal state In each of the 16 cells, you can use one of the four actions, namely up/down/left/right, to move to a neighboring state. The goal of the game is to start from state S and end at state G. We will show how we can use a neural network-based Q-learning system to learn a safe path from state S to state G. First, we import the necessary packages and define the game environment: import gym import numpy as np import random import tensorflow as tf env = gym.make('FrozenLake-v0') Once the environment is defined, we can define the network structure that learns the Qvalues. We will use a one-layer neural network with 16 hidden neurons and 4 output neurons as follows: input_matrix = tf.placeholder(shape=[1,16],dtype=tf.float32) weight_matrix = tf.Variable(tf.random_uniform([16,4],0,0.01)) Q_matrix = tf.matmul(input_matrix,weight_matrix) prediction_matrix = tf.argmax(Q_matrix,1) nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32) loss = tf.reduce_sum(tf.square(nextQ - Q_matrix)) train = tf.train.GradientDescentOptimizer(learning_rate=0.05) model = train.minimize(loss) init_op = tf.global_variables_initializer() Now we can choose the action greedily: ip_q = np.zeros(num_states) ip_q[current_state] = 1 a,allQ = sess.run([prediction_matrix,Q_matrix],feed_dict={input_matrix: [ip_q]}) if np.random.rand(1) < sample_epsilon: a[0] = env.action_space.sample() next_state, reward, done, info = env.step(a[0]) ip_q1 = np.zeros(num_states) ip_q1[next_state] = 1 Q1 = sess.run(Q_matrix,feed_dict={input_matrix:[ip_q1]}) maxQ1 = np.max(Q1) targetQ = allQ targetQ[0,a[0]] = reward + y*maxQ1 _,W1 = sess.run([model,weight_matrix],feed_dict={input_matrix: [ip_q],nextQ:targetQ}) Figure RL with Q-learning example shows the sample output of the program when executed. You can see different values of Q matrix as the agent moves from one state to the other. You also notice a value of reward 1 when the agent is in state 15: To summarize, we saw how reinforcement learning can be practically implemented using TensorFlow. If you found this post useful, do check out the book Deep Learning Essentials which will help you fine-tune and optimize your deep learning models for better performance.  
Read more
  • 0
  • 0
  • 17755

article-image-is-mozilla-the-most-progressive-tech-organization-on-the-planet-right-now
Richard Gall
16 Oct 2018
7 min read
Save for later

Is Mozilla the most progressive tech organization on the planet right now?

Richard Gall
16 Oct 2018
7 min read
2018, according to The Economist, has been the year of the techlash. scandals, protests, resignations, congressional testimonies - many of the largest companies in the world have been in the proverbial dock for a distinct lack of accountability. Together, these stories have created a narrative where many are starting to question the benefits of unbridled innovation. But Mozilla is one company that seems to have bucked that trend. In recent weeks there have been a series of news stories that suggest Mozilla is a company thinking differently about its place in the world, as well as the wider challenges technology poses society. All of these come together to present Mozilla in a new light. Cynics might suggest that much of this is little more than some smart PR work, but it's a little unfair to dismiss what some impressive work. So much has been happening across the industry that deserves scepticism at best and opprobrium at worst. To see a tech company stand out from the tiresome pattern of stories this year can only be a good thing. Mozilla on education: technology, ethical code, and the humanities Code ethics has become a big topic of conversation in 2018. And rightly so - with innovation happening at an alarming pace, it has become easy to make the mistake of viewing technology as a replacement for human agency, rather than something that emerges from it. When we talk about code ethics it reminds us that technology is something built from the decisions and actions of thousands of different people. It’s for this reason that last week’s news that Mozilla has teamed up with a number of organizations, including the Omidyar Network to announce a brand new prize for computer science students feels so important. At a time when the likes of Mark Zuckerberg dance around any notion of accountability, peddling a narrative where everything is just a little bit beyond Facebook’s orbit of control, the ‘Responsible Computer Science Challenge’ stands out. With $3.5 million up for grabs for smart computer science students, it’s evidence that Mozilla is putting its money where its mouth is and making ethical decision making something which, for once, actually pays. Mitchell Baker on the humanities and technology Mitchell Baker’s comments to the Guardian that accompanied the news also demonstrate a refreshingly honest perspective from a tech leader. “One thing that’s happened in 2018,” Baker said, “is that we’ve looked at the platforms, and the thinking behind the platforms, and the lack of focus on impact or result. It crystallised for me that if we have STEM education without the humanities, or without ethics, or without understanding human behaviour, then we are intentionally building the next generation of technologists who have not even the framework or the education or vocabulary to think about the relationship of STEM to society or humans or life.” Baker isn’t, however, a crypto-luddite or an elitist that wants full stack developer classicists. Instead she’s looking forward at the ways in which different disciplines can interact and inform one another. It’s arguably an intellectual form of DevOps. It is a way of bridging the gap between STEM skills and practices, and those rooted in the tradition of the humanities. The significance of this intervention shouldn’t be understated. It opens up a dialogue within society and the tech industry that might get us to a place where ethics is simply part and parcel of what it means to build and design software, not an optional extra. Mozilla’s approach to internal diversity: dropping meritocracy The respective cultures of organizations and communities across tech has been in the spotlight over the last few months. Witness the bitter furore over Linux change to its community guidelines to see just how important definitions and guidelines are to the people within them. That’s why Mozilla’s move to drop meritocracy from its guidelines of governance and leadership structures was a small yet significant move. It’s simply another statement of intent from a company eager to ensure it helps develop a culture more open and inclusive than the tech world has been over the last thirty decades. In a post published on the Mozilla blog at the start of October, Emma Irwin (D&I Strategy, Mozilla Contributors and Communities) and Larissa Shapiro (Head of Global Diversity & Inclusion at Mozilla) wrote that “Meritocracy does not consider the reality that tech does not operate on a level playing field.” The new governance proposal actually reflects Mozilla’s apparent progressiveness pretty well. In it, it states that “the project also seeks to debias this system of distributing authority through active interventions that engage and encourage participation from diverse communities.” While there has been some criticism of the change, it’s important to note that the words used by organizations of this size does have an impact on how we frame and focus problems. From this perspective, Mozilla’s decision could well be a vital small step in making tech more accessible and diverse. The tech world needs to engage with political decision makers Mozilla isn't just a 'progressive' tech company because of the content of its political beliefs. Instead, what's particularly important is how it appears to recognise that the problems that technology faces and engages with are, in fact, much bigger than technology itself. Just consider the actions of other tech leaders this year. Sundar Pichai didn't attend his congressional hearing, Jack Dorsey assured us that Twitter has safety at its heart while verifying neo-Nazis, while Mark Zuckerberg suggested that AI can fix the problems of election interference and fake news. The hubris has been staggering. Mozilla's leadership appears to be trying hard to avoid the same pitfalls. We shouldn’t be surprised that Mozilla actually embraced the idea of 2018’s ‘techlash.' The organization used the term in the title of a post directed at G20 leaders in August. Written alongside The Internet Society and the Web Foundation, it urged global leaders to “reinject hope back into technological innovation.” Implicit in the post is an acknowledgement that the aims and goals of much of the tech industry - improving people’s lives, making infrastructure more efficient - can’t be purely solved by the industry itself. It is a subtle stab at what might be considered hubris. Taking on government and regulation But this isn’t to say Mozilla is completely in thrall to government and regulation. Most recently (16 October), Mozilla voiced its concerns about current decryption laws being debated in Australian Parliament. The organization was clear, saying "this is at odds with the core principles of open source, user expectations, and potentially contractual license obligations.” At the beginning of September Mozilla also spoke out against EU copyright reform. The organization argued that “article 13 will be used to restrict the freedom of expression and creative potential of independent artists who depend upon online services to directly reach their audience and bypass the rigidities and limitations of the commercial content industry.”# While opposition to EU copyright reform came from a range of voices - including those huge corporations that have come under scrutiny during the ‘techlash’ - Mozilla is, at least, consistent. The key takeaway from Mozilla: let’s learn the lessons of 2018’s techlash The techlash has undoubtedly caused a lot of pain for many this year. But the worst thing that could happen is for the tech industry to fail to learn the lessons that are emerging. Mozilla deserve credit for trying hard to properly understand the implications of what’s been happening and develop a deliberate vision for how to move forward.
Read more
  • 0
  • 0
  • 17739

article-image-ai-now-institute-releases-current-state-of-ai-2018-report
Natasha Mathur
07 Dec 2018
7 min read
Save for later

AI Now Institute releases Current State of AI 2018 Report

Natasha Mathur
07 Dec 2018
7 min read
The AI Now Institute, New York University, released its third annual report on the current state of AI, yesterday.  2018 AI Now Report focused on themes such as industry AI scandals, and rising inequality. It also assesses the gaps between AI ethics and meaningful accountability, as well as looks at the role of organizing and regulation in AI. Let’s have a look at key recommendations from the AI Now 2018 report. Key Takeaways Need for a sector-specific approach to AI governance and regulation This year’s report reflects on the need for stronger AI regulations by expanding the powers of sector-specific agencies (such as United States Federal Aviation Administration and the National Highway Traffic Safety Administration) to audit and monitor these technologies based on domains. Development of AI systems is rising and there aren’t adequate governance, oversight, or accountability regimes to make sure that these systems abide by the ethics of AI. The report states how general AI standards and certification models can’t meet the expertise requirements for different sectors such as health, education, welfare, etc, which is a key requirement for enhanced regulation. “We need a sector-specific approach that does not prioritize the technology but focuses on its application within a given domain”, reads the report. Need for tighter regulation of Facial recognition AI systems Concerns are growing over facial recognition technology as they’re causing privacy infringement, mass surveillance, racial discrimination, and other issues. As per the report, stringent regulation laws are needed that demands stronger oversight, public transparency, and clear limitations. Moreover, only providing public notice shouldn’t be the only criteria for companies to apply these technologies. There needs to be a “high threshold” for consent, keeping in mind the risks and dangers of mass surveillance technologies. The report highlights how “affect recognition”, a subclass of facial recognition that claims to be capable of detecting personality, inner feelings, mental health, etc, depending on images or video of faces, needs to get special attention, as it is unregulated. It states how these claims do not have sufficient evidence behind them and are being abused in unethical and irresponsible ways.“Linking affect recognition to hiring, access to insurance, education, and policing creates deeply concerning risks, at both an individual and societal level”, reads the report. It seems like progress is being made on this front, as it was just yesterday when Microsoft recommended that tech companies need to publish documents explaining the technology’s capabilities, limitations, and consequences in case their facial recognition systems get used in public. New approaches needed for governance in AI The report points out that internal governance structures at technology companies are not able to implement accountability effectively for AI systems. “Government regulation is an important component, but leading companies in the AI industry also need internal accountability structures that go beyond ethics guidelines”, reads the report.  This includes rank-and-file employee representation on the board of directors, external ethics advisory boards, along with independent monitoring and transparency efforts. Need to waive trade secrecy and other legal claims The report states that Vendors and developers creating AI and automated decision systems for use in government should agree to waive any trade secrecy or other legal claims that would restrict the public from full auditing and understanding of their software. As per the report, Corporate secrecy laws are a barrier as they make it hard to analyze bias, contest decisions, or remedy errors. Companies wanting to use these technologies in the public sector should demand the vendors to waive these claims before coming to an agreement. Companies should protect workers from raising ethical concerns It has become common for employees to organize and resist technology to promote accountability and ethical decision making. It is the responsibility of these tech companies to protect their workers’ ability to organize, whistleblow, and promote ethical choices regarding their projects. “This should include clear policies accommodating and protecting conscientious objectors, ensuring workers the right to know what they are working on, and the ability to abstain from such work without retaliation or retribution”, reads the report. Need for more in truth in advertising of AI products The report highlights that the hype around AI has led to a gap between marketing promises and actual product performance, causing risks to both individuals and commercial customers. As per the report, AI vendors should be held to high standards when it comes to them making promises, especially when there isn’t enough information on the consequences and the scientific evidence behind these promises. Need to address exclusion and discrimination within the workplace The report states that the Technology companies and the AI field focus on the “pipeline model,” that aims to train and hire more employees. However, it is important for tech companies to assess the deeper issues such as harassment on the basis of gender, race, etc, within workplaces. They should also examine the relationship between exclusionary cultures and the products they build, so to build tools that do not perpetuate bias and discrimination. Detailed account of the “full stack supply chain” As per the report, there is a need to better understand the parts of an AI system and the full supply chain on which it relies for better accountability. “This means it is important to account for the origins and use of training data, test data, models, the application program interfaces (APIs), and other components over a product lifecycle”, reads the paper. This process is called accounting for the ‘full stack supply chain’ of AI systems, which is necessary for a more responsible form of auditing. The full stack supply chain takes into consideration the true environmental and labor costs of AI systems. This includes energy use, labor use for content moderation and training data creation, and reliance on workers for maintenance of AI systems. More funding and support for litigation, and labor organizing on AI issues The report states that there is a need for increased support for legal redress and civic participation. This includes offering support to public advocates representing people who have been exempted from social services because of algorithmic decision making, civil society organizations and labor organizers who support the groups facing dangers of job loss and exploitation. Need for University AI programs to expand beyond computer science discipline The report states that there is a need for university programs and syllabus to expand its disciplinary orientation. This means the inclusion of social and humanistic disciplines within the universities AI programs. For AI efforts to truly make social impacts, it is necessary to train the faculty and students within the computer science departments, to research the social world. A lot of people have already started to implement this, for instance, Mitchell Baker, chairwoman, and co-founder of Mozilla talked about the need for the tech industry to expand beyond the technical skills by bringing in humanities. “Expanding the disciplinary orientation of AI research will ensure deeper attention to social contexts, and more focus on potential hazards when these systems are applied to human populations”, reads the paper. For more coverage, check out the official AI Now 2018 report. Unity introduces guiding Principles for ethical AI to promote responsible use of AI Teaching AI ethics – Trick or Treat? Sex robots, artificial intelligence, and ethics: How desire shapes and is shaped by algorithms
Read more
  • 0
  • 0
  • 17718
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-how-neurips-2018-is-taking-on-its-diversity-and-inclusion-challenges
Sugandha Lahoti
06 Dec 2018
3 min read
Save for later

How NeurIPS 2018 is taking on its diversity and inclusion challenges

Sugandha Lahoti
06 Dec 2018
3 min read
This year the Neural Information Processing Systems Conference is asking serious questions to improve diversity, equity, and inclusion at NeurIPS. “Our goal is to make the conference as welcoming as possible to all.” said the heads of the new diversity and inclusion chairs introduced this year. https://twitter.com/InclusionInML/status/1069987079285809152 The Diversity and Inclusion chairs were headed by Hal Daume III, a professor from the University of Maryland and machine learning and fairness groups researcher at Microsoft Research and Katherine Heller, assistant professor at Duke University and research scientist at Google Brain. They opened up the talk by acknowledging the respective privilege that they get as a group of white man and woman and the fact that they don’t reflect the diversity of experience in the conference room, much less the world. They talk about the three major goals with respect to inclusion at NeurIPS: Learn about the challenges that their colleagues have faced. Support those doing the hard work of amplifying the voices of those who have been historically excluded. To begin structural changes that will positively impact the community over the coming years. They urged attendees to start building an environment where everyone can do their best work. They want people to: see other perspectives remember the feeling of being an outsider listen, do research and learn. make an effort and speak up Concrete actions taken by the NeurIPS diversity and inclusion chairs This year they have assembled an advisory board and run a demographics and inclusion survey. They have also conducted events such as WIML (Women in Machine Learning), Black in AI, LatinX in AI, and Queer in AI. They have established childcare subsidies and other activities in collaboration with Google and DeepMind to support all families attending NeurIPS by offering a stipend of up to $100 USD per day. They have revised their Code of Conduct, to provide an experience for all participants that is free from harassment, bullying, discrimination, and retaliation. They have added inclusion tips on Twitter offering tips and bits of advice related to D&I efforts. The conference also offers pronoun stickers (only them and they), first-time attendee stickers, and information for participant needs. They have also made significant infrastructure improvements for visa handling. They had discussions with people handling visas on location, sent out early invitation letters for visas, and are choosing future locations with visa processing in mind. In the future, they are also looking to establish a legal team for details around Code of Conduct. Further, they are looking to improve institutional structural changes that support the community, and improve the coordination around affinity groups & workshops. For the first time, NeurIPS also invited a diversity and inclusion (D&I) speaker Laura Gomez to talk about the lack of diversity in the tech industry, which leads to biased algorithms, faulty products, and unethical tech. Head over to NeurIPS website for interesting tutorials, invited talks, product releases, demonstrations, presentations, and announcements. NeurIPS 2018: Deep learning experts discuss how to build adversarially robust machine learning models NeurIPS 2018 paper: DeepMind researchers explore autoregressive discrete autoencoders (ADAs) to model music in raw audio at scale NeurIPS 2018: A quick look at data visualization for Machine learning by Google PAIR researchers [Tutorial]
Read more
  • 0
  • 0
  • 17690

article-image-popular-data-sources-and-models-in-sap-analytics-cloud
Kunal Chaudhari
03 Jan 2018
12 min read
Save for later

Popular Data sources and models in SAP Analytics Cloud

Kunal Chaudhari
03 Jan 2018
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Riaz Ahmed titled Learning SAP Analytics Cloud.This book deals with the basics of SAP Analytics Cloud (formerly known as SAP BusinessObjects Cloud) and unveil significant features for a beginner.[/box] Our article provides a brief overview of the different data sources and models, available in SAP Analytics Cloud. A model is the foundation of every analysis you create to evaluate the performance of your organization. It is a high-level design that exposes the analytic requirements of end users. Planning and analytics are the two types of models you can create in SAP Analytics Cloud. Analytics models are simpler and more flexible, while planning models are full-featured models in which you work with planning features. Preconfigured with dimensions for time and categories, planning models support multi-currency and security features at both model and dimension levels.   To determine what content to include in your model, you must first identify the columns from the source data on which users need to query. The columns you need in your model reside in some sort of data source. SAP Analytics Cloud supports three types of data sources: files (such as CSV or Excel files) that usually reside on your computer, live data connections from a connected remote system, and cloud apps. In addition to the files on your computer, you can use on-premise data sources, such as SAP Business Warehouse, SAP ERP, SAP Universe, SQL database, and more, to acquire data for your models. In the cloud, you can get data from apps such as Concur, Google Drive, SAP Business ByDesign, SAP Hybris Cloud, OData Services, and Success Factors. The following figure depicts these data sources. The cloud app data sources you can use with SAP Analytics Cloud are displayed above the firewall mark, while those in your local network are shown under the firewall. As you can see in the following figure, there are over twenty data sources currently supported by SAP Analytics Cloud. The methods of connecting to these data sources also vary from each other. However, some instances provided in this article would give you an idea on how connections are established to acquire data. The connection methods provided here relate to on-premise and cloud app data sources. Create a direct live connection to SAP HANA Execute the following steps to connect to the on-premise SAP HANA system to use live data in SAP Analytics Cloud. Live data means that you can get up-to-the-minute data when you open a story in SAP Analytics Cloud. In this case, any changes made to the data in the source system are reflected immediately. Usually, there are two ways to establish a connection to a data source--use the Connection option from the main menu, or specify the data source during the process of creating a model. However, live data connections must be established via the Connection menu option prior to creating the corresponding model. Here are the steps: From the main menu, select Connection. On the Connections page, click on the Add Connection icon (+), and select Live Data Connection | SAP HANA. In the New Live Connection dialog, enter a name for the connection (for example, HANA). From the Connection Type drop-down list, select Direct. The Direct option is used when you connect to a data source that resides inside your corporate network. The Path option requires a reverse proxy to the HANA XS server. The SAP Cloud Platform and Cloud options in this list are used when you are connecting to SAP cloud environments. When you select the Direct option, the System Type is set to HANA and the protocol is set to HTTPS. Enter the hostname and port number in respective text boxes. The Authentication Method list contains two options: User Name and Password and SAML Single Sign On. The SAML Single Sign On option requires that the SAP HANA system is already configured to use SAML authentication. If not, choose the User Name and Password option and enter these credentials in relevant boxes. Click on OK to finish the process. A new connection will appear on the Connection page, which can now be used as a data source for models. However, in order to complete this exercise, we will go through a short demo of this process here. From the main menu, go to Create | Model. On the New Model page, select Use a datasource. From the list that appears on your right side, select Live Data connection. In the dialog that is displayed, select the HANA connection you created in the previous steps from the System list. From the Data Source list, select the HANA view you want to work with. The list of views may be very long, and a search feature is available to help you locate the source you are looking for. Finally, enter the name and the optional description for the new model, and click on OK. The model will be created, and its definitions will appear on another page. Connecting remote systems to import data In addition to creating live connections, you can also create connections that allow you to import data into SAP Analytics Cloud. In these types of connections that you make to access remote systems, data is imported (copied) to SAP Analytics Cloud. Any changes users make in the source data do not affect the imported data. To establish connections with these remote systems, you need to install some additional components. For example, you must install SAP HANA Cloud connector to access SAP Business Planning and Consolidation (BPC) for Netweaver . Similarly, SAP Analytics Cloud agent should be installed for SAP Business Warehouse (BW), SQL Server, SAP ERP, and others. Take a look at the connection figure illustrated on a previous page. The following set of steps provide instructions to connect to SAP ERP. You can either connect to this system from the Connection menu or establish the connection while creating a model. In these steps, we will adopt the latter approach. From the main menu, go to Create | Model. 2. Click on the Use a datasource option on the choose how you'd like to start your model page. 3. From the list of available datasources to your right, select SAP ERP. 4. From the Connection Name list, select Create New Connection. 5. Enter a name for the connection (for example, ERP) in the Connection Name box. You can also provide a       description to further elaborate the new connection. 6. For Server Type, select Application Server and enter values for System,   System Number, Client ID, System ID, Language, User Name, and Password. Click the Create button after providing this information. 7. Next, you need to create a query based on the SAP ERP system data. Enter  a name for the query, for example, sales. 8. In the same dialog, expand the ERP object where the data exists. Locate and select the object, and then choose the data columns you want to include in your model. You are provided with a preview of the data before importing. On the preview window, click on Done to start the import process. The imported data will appear on the Data Integration page, which is the initial screen in the model creation segment. Connect Google Drive to import data You went through two scenarios in which you saw how data can be fetched. In the first scenario, you created a live connection to create a model on live data, while in the second one, you learned how to import data from remote systems. In this article, you will be guided to create a model using a cloud app called Google Drive. Google Drive is a file storage and synchronization service developed by Google. It allows users to store files in the cloud, synchronize files across devices, and share files. Here are the steps to use the data stored on Google Drive: From the main menu, go to Create | Model. On the choose how you'd like to start your model page, select Get data from an app. From the available apps to your right, select Google Drive.  In the Import Model From Google Drive dialog, click on the Select Data button.  If you are not already logged into Google Drive, you will be prompted to log in.  Another dialog appears displaying a list of compatible files. Choose a file, and click on the Select button. You are brought back to the Import Model From Google Drive dialog, where you have to enter a model name and an optional description. After providing this information, click on the Import button. The import process will start, and after a while, you will see the Data Integration screen populated with the data from the selected Google Drive file. Refreshing imported data SAP Analytics Cloud allows you to refresh your imported data. With this option, you can re-import the data on demand to get the latest values. You can perform this refresh operation either manually or create an import schedule to refresh the data at a specific date and time or on a recurring basis. The following data sources support scheduling: SAP Business Planning and Consolidation (BPC) SAP Business Warehouse (BW) Concur OData services An SAP Analytics BI platform universe (UNX) query SAP ERP Central Component (SAP ECC) SuccessFactors [DC3] HCM suite Excel and comma-separated values (CSV) files imported from a file server (not imported from your local machine) SQL databases You can adopt the following method to access the schedule settings for a model: Select Connection from the main menu. The Connection page appears. The Schedule Status tab on this page lists all updates and import jobs associated with any data source. Alternatively, go to main menu | Browse | Models. The Models page appears. The updatable model on the list will have a number of data sources shown in the Datasources column. In the Datasources column, click on the View More link. The update and import jobs associated with this data source will appear. The Update Model and Import Data job are the two types of jobs that are run either immediately or on a schedule. To run an Import Data job immediately, choose Import Data in the Action column. If you want to run an Update Model job, select a job to open it. The following refreshing methods specify how you want existing data to be handled. The Import Data jobs are listed here: Update: Selecting this option updates the existing data and adds new entries to the target model. Clean and Replace: Any existing data is wiped out and new entries are added to the target model. Append: Nothing is done with the existing data. Only new entries are added to the target model. The Update Model jobs are listed here: Clean and Replace: This deletes the existing data and adds new entries to the target model. Append: This keeps the existing data as is and adds new entries to the target model. The Schedule Settings option allows you to select one of the following schedule options: None: The import is performed immediately Once: The import is performed only once at a scheduled time Repeating: The import is executed according to a repeating pattern; you can select a start and end date and time as well as a recurrence pattern After setting your preferences, click on the Save icon to save your scheduling settings. If you chose the None option for scheduling, select Update Model or Import Data to run the update or import job now. Once a scheduled job completes, its result appears on the Schedule Status tab displaying any errors or warnings. If you see such daunting messages, select the job to see the details. Expand an entry in the Refresh Manager panel to get more information about the scary stuff. If the import process rejected any rows in the dataset, you are provided with an option to download the rejected rows as a CSV file for offline examination. Fix the data in the source system, or fix the error in the downloaded CSV file and upload data from it. After creating your models, you access them via the main menu | Browse | Models path. The Models page, as illustrated in the following figure, is the main interface where you manage your models. All existing models are listed under the Models tab. You can open a model by clicking on its name. Public dimensions are saved separately from models and appear on the Public Dimensions tab. When you create a new model or modify an existing model, you can add these public dimensions. If you are using multiple currencies in your data, the exchange rates are maintained in separate tables. These are saved independently of any model and are listed on the Currency Conversion tab. Data for geographic locations, which are displayed and used in your data analysis, is maintained on the Points of Interest tab. The toolbar provided under the four tabs carries icons to perform common operations for managing models. Click on the New Model icon to create a new model. Select a model by placing a check mark (A) in front of it. Then click on the Copy Selected Model icon to make an exact copy of the selected model. Use the delete icon to remove the selected models. The Clear Selected Model option removes all the data from the selected model. The list of data import options that are supported is available from a menu beneath the Import Data icon on the toolbar. You can export a model to a .csv file once or on a recurring schedule using Export Model As File. SAP Analytics Cloud can help transform how you discover, plan, predict, collaborate, visualize, and extend all in one solution. In addition to on-premise data sources, you can fetch data from a variety of other cloud apps and even from Excel and text files to build your data models and then create stories based on these models. If you enjoyed this excerpt, check out the book Learning SAP Analytics Cloud to know more about professional data analysis using different types of charts, tables, geo maps, and more with SAP Analytics Cloud.    
Read more
  • 0
  • 0
  • 17659

article-image-splunk-how-to-work-with-multiple-indexes-tutorial
Pravin Dhandre
20 Jun 2018
12 min read
Save for later

Splunk: How to work with multiple indexes [Tutorial]

Pravin Dhandre
20 Jun 2018
12 min read
An index in Splunk is a storage pool for events, capped by size and time. By default, all events will go to the index specified by defaultDatabase, which is called main but lives in a directory called defaultdb. In this tutorial, we put focus to index structures, need of multiple indexes, how to size an index and how to manage multiple indexes in a Splunk environment. This article is an excerpt from a book written by James D. Miller titled Implementing Splunk 7 - Third Edition. Directory structure of an index Each index occupies a set of directories on the disk. By default, these directories live in $SPLUNK_DB, which, by default, is located in $SPLUNK_HOME/var/lib/splunk. Look at the following stanza for the main index: [main] homePath = $SPLUNK_DB/defaultdb/db coldPath = $SPLUNK_DB/defaultdb/colddb thawedPath = $SPLUNK_DB/defaultdb/thaweddb maxHotIdleSecs = 86400 maxHotBuckets = 10 maxDataSize = auto_high_volume If our Splunk installation lives at /opt/splunk, the index main is rooted at the path /opt/splunk/var/lib/splunk/defaultdb. To change your storage location, either modify the value of SPLUNK_DB in $SPLUNK_HOME/etc/splunk-launch.conf or set absolute paths in indexes.conf. splunk-launch.conf cannot be controlled from an app, which means it is easy to forget when adding indexers. For this reason, and for legibility, I would recommend using absolute paths in indexes.conf. The homePath directories contain index-level metadata, hot buckets, and warm buckets. coldPath contains cold buckets, which are simply warm buckets that have aged out. See the upcoming sections The lifecycle of a bucket and Sizing an index for details. When to create more indexes There are several reasons for creating additional indexes. If your needs do not meet one of these requirements, there is no need to create more indexes. In fact, multiple indexes may actually hurt performance if a single query needs to open multiple indexes. Testing data If you do not have a test environment, you can use test indexes for staging new data. This then allows you to easily recover from mistakes by dropping the test index. Since Splunk will run on a desktop, it is probably best to test new configurations locally, if possible. Differing longevity It may be the case that you need more history for some source types than others. The classic example here is security logs, as compared to web access logs. You may need to keep security logs for a year or more, but need the web access logs for only a couple of weeks. If these two source types are left in the same index, security events will be stored in the same buckets as web access logs and will age out together. To split these events up, you need to perform the following steps: Create a new index called security, for instance Define different settings for the security index Update inputs.conf to use the new index for security source types For one year, you might make an indexes.conf setting such as this: [security] homePath = $SPLUNK_DB/security/db coldPath = $SPLUNK_DB/security/colddb thawedPath = $SPLUNK_DB/security/thaweddb #one year in seconds frozenTimePeriodInSecs = 31536000 For extra protection, you should also set maxTotalDataSizeMB, and possibly coldToFrozenDir. If you have multiple indexes that should age together, or if you will split homePath and coldPath across devices, you should use volumes. See the upcoming section, Using volumes to manage multiple indexes, for more information. Then, in inputs.conf, you simply need to add an index to the appropriate stanza as follows: [monitor:///path/to/security/logs/logins.log] sourcetype=logins index=security Differing permissions If some data should only be seen by a specific set of users, the most effective way to limit access is to place this data in a different index, and then limit access to that index by using a role. The steps to accomplish this are essentially as follows: Define the new index. Configure inputs.conf or transforms.conf to send these events to the new index. Ensure that the user role does not have access to the new index. Create a new role that has access to the new index. Add specific users to this new role. If you are using LDAP authentication, you will need to map the role to an LDAP group and add users to that LDAP group. To route very specific events to this new index, assuming you created an index called sensitive, you can create a transform as follows: [contains_password] REGEX = (?i)password[=:] DEST_KEY = _MetaData:Index FORMAT = sensitive You would then wire this transform to a particular sourcetype or source index in props.conf. Using more indexes to increase performance Placing different source types in different indexes can help increase performance if those source types are not queried together. The disks will spend less time seeking when accessing the source type in question. If you have access to multiple storage devices, placing indexes on different devices can help increase the performance even more by taking advantage of different hardware for different queries. Likewise, placing homePath and coldPath on different devices can help performance. However, if you regularly run queries that use multiple source types, splitting those source types across indexes may actually hurt performance. For example, let's imagine you have two source types called web_access and web_error. We have the following line in web_access: 2012-10-19 12:53:20 code=500 session=abcdefg url=/path/to/app And we have the following line in web_error: 2012-10-19 12:53:20 session=abcdefg class=LoginClass If we want to combine these results, we could run a query like the following: (sourcetype=web_access code=500) OR sourcetype=web_error | transaction maxspan=2s session | top url class If web_access and web_error are stored in different indexes, this query will need to access twice as many buckets and will essentially take twice as long. The life cycle of a bucket An index is made up of buckets, which go through a specific life cycle. Each bucket contains events from a particular period of time. The stages of this lifecycle are hot, warm, cold, frozen, and thawed. The only practical difference between hot and other buckets is that a hot bucket is being written to, and has not necessarily been optimized. These stages live in different places on the disk and are controlled by different settings in indexes.conf: homePath contains as many hot buckets as the integer value of maxHotBuckets, and as many warm buckets as the integer value of maxWarmDBCount. When a hot bucket rolls, it becomes a warm bucket. When there are too many warm buckets, the oldest warm bucket becomes a cold bucket. Do not set maxHotBuckets too low. If your data is not parsing perfectly, dates that parse incorrectly will produce buckets with very large time spans. As more buckets are created, these buckets will overlap, which means all buckets will have to be queried every time, and performance will suffer dramatically. A value of five or more is safe. coldPath contains cold buckets, which are warm buckets that have rolled out of homePath once there are more warm buckets than the value of maxWarmDBCount. If coldPath is on the same device, only a move is required; otherwise, a copy is required. Once the values of frozenTimePeriodInSecs, maxTotalDataSizeMB, or maxVolumeDataSizeMB are reached, the oldest bucket will be frozen. By default, frozen means deleted. You can change this behavior by specifying either of the following: coldToFrozenDir: This lets you specify a location to move the buckets once they have aged out. The index files will be deleted, and only the compressed raw data will be kept. This essentially cuts the disk usage by half. This location is unmanaged, so it is up to you to watch your disk usage. coldToFrozenScript: This lets you specify a script to perform some action when the bucket is frozen. The script is handed the path to the bucket that is about to be frozen. thawedPath can contain buckets that have been restored. These buckets are not managed by Splunk and are not included in all time searches. To search these buckets, their time range must be included explicitly in your search. I have never actually used this directory. Search https://splunk.com for restore archived to learn the procedures. Sizing an index To estimate how much disk space is needed for an index, use the following formula: (gigabytes per day) * .5 * (days of retention desired) Likewise, to determine how many days you can store an index, the formula is essentially: (device size in gigabytes) / ( (gigabytes per day) * .5 ) The .5 represents a conservative compression ratio. The log data itself is usually compressed to 10 percent of its original size. The index files necessary to speed up search brings the size of a bucket closer to 50 percent of the original size, though it is usually smaller than this. If you plan to split your buckets across devices, the math gets more complicated unless you use volumes. Without using volumes, the math is as follows: homePath = (maxWarmDBCount + maxHotBuckets) * maxDataSize coldPath = maxTotalDataSizeMB - homePath For example, say we are given these settings: [myindex] homePath = /splunkdata_home/myindex/db coldPath = /splunkdata_cold/myindex/colddb thawedPath = /splunkdata_cold/myindex/thaweddb maxWarmDBCount = 50 maxHotBuckets = 6 maxDataSize = auto_high_volume #10GB on 64-bit systems maxTotalDataSizeMB = 2000000 Filling in the preceding formula, we get these values: homePath = (50 warm + 6 hot) * 10240 MB = 573440 MB coldPath = 2000000 MB - homePath = 1426560 MB If we use volumes, this gets simpler and we can simply set the volume sizes to our available space and let Splunk do the math. Using volumes to manage multiple indexes Volumes combine pools of storage across different indexes so that they age out together. Let's make up a scenario where we have five indexes and three storage devices. The indexes are as follows: Name Data per day Retention required Storage needed web 50 GB no requirement ? security 1 GB 2 years 730 GB * 50 percent app 10 GB no requirement ? chat 2 GB 2 years 1,460 GB * 50 percent web_summary 1 GB 1 years 365 GB * 50 percent Now let's say we have three storage devices to work with, mentioned in the following table: Name Size small_fast 500 GB big_fast 1,000 GB big_slow 5,000 GB We can create volumes based on the retention time needed. Security and chat share the same retention requirements, so we can place them in the same volumes. We want our hot buckets on our fast devices, so let's start there with the following configuration: [volume:two_year_home] #security and chat home storage path = /small_fast/two_year_home maxVolumeDataSizeMB = 300000 [volume:one_year_home] #web_summary home storage path = /small_fast/one_year_home maxVolumeDataSizeMB = 150000 For the rest of the space needed by these indexes, we will create companion volume definitions on big_slow, as follows: [volume:two_year_cold] #security and chat cold storage path = /big_slow/two_year_cold maxVolumeDataSizeMB = 850000 #([security]+[chat])*1024 - 300000 [volume:one_year_cold] #web_summary cold storage path = /big_slow/one_year_cold maxVolumeDataSizeMB = 230000 #[web_summary]*1024 - 150000 Now for our remaining indexes, whose timeframe is not important, we will use big_fast and the remainder of big_slow, like so: [volume:large_home] #web and app home storage path = /big_fast/large_home maxVolumeDataSizeMB = 900000 #leaving 10% for pad [volume:large_cold] #web and app cold storage path = /big_slow/large_cold maxVolumeDataSizeMB = 3700000 #(big_slow - two_year_cold - one_year_cold)*.9 Given that the sum of large_home and large_cold is 4,600,000 MB, and a combined daily volume of web and app is 60,000 MB approximately, we should retain approximately 153 days of web and app logs with 50 percent compression. In reality, the number of days retained will probably be larger. With our volumes defined, we now have to reference them in our index definitions: [web] homePath = volume:large_home/web coldPath = volume:large_cold/web thawedPath = /big_slow/thawed/web [security] homePath = volume:two_year_home/security coldPath = volume:two_year_cold/security thawedPath = /big_slow/thawed/security coldToFrozenDir = /big_slow/frozen/security [app] homePath = volume:large_home/app coldPath = volume:large_cold/app thawedPath = /big_slow/thawed/app [chat] homePath = volume:two_year_home/chat coldPath = volume:two_year_cold/chat thawedPath = /big_slow/thawed/chat coldToFrozenDir = /big_slow/frozen/chat [web_summary] homePath = volume:one_year_home/web_summary coldPath = volume:one_year_cold/web_summary thawedPath = /big_slow/thawed/web_summary thawedPath cannot be defined using a volume and must be specified for Splunk to start. For extra protection, we specified coldToFrozenDir for the indexes' security and chat. The buckets for these indexes will be copied to this directory before deletion, but it is up to us to make sure that the disk does not fill up. If we allow the disk to fill up, Splunk will stop indexing until space is made available. This is just one approach to using volumes. You could overlap in any way that makes sense to you, as long as you understand that the oldest bucket in a volume will be frozen first, no matter what index put the bucket in that volume. With this, we learned to operate multiple indexes and how we can get effective business intelligence out of the data without hurting system performance. If you found this tutorial useful, do check out the book Implementing Splunk 7 - Third Edition and start creating advanced Splunk dashboards. Splunk leverages AI in its monitoring tools Splunk’s Input Methods and Data Feeds Splunk Industrial Asset Intelligence (Splunk IAI) targets Industrial IoT marketplace
Read more
  • 0
  • 0
  • 17592

article-image-neurips-2018-developments-in-machine-learning-through-the-lens-of-counterfactual-inference-tutorial
Savia Lobo
15 Dec 2018
7 min read
Save for later

NeurIPS 2018: Developments in machine learning through the lens of Counterfactual Inference [Tutorial]

Savia Lobo
15 Dec 2018
7 min read
The 32nd NeurIPS Conference kicked off on the 2nd of December and continued till the 8th of December in Montreal, Canada. This conference covered tutorials, invited talks, product releases, demonstrations, presentations, and announcements related to machine learning research. “Counterfactual Inference” is one such tutorial presented during the NeurIPS by Susan Athey, The Economics of Technology Professor at the Stanford Graduate School of Business. This tutorial reviewed the literature that brings together recent developments in machine learning with methods for counterfactual inference. It will focus on problems where the goal is to estimate the magnitude of causal effects, as well as to quantify the researcher’s uncertainty about these magnitudes. She starts by mentioning that there are two sets of issues make causal inference must know concepts for AI. Some gaps between what we are doing in our research, and what the firms are applying. There are success stories such as Google images and so on. However, the top tech companies also do not fully adopt all the machine learning / AI concepts fully. If a firm dumps their old simple regression credit scoring model and makes use of a black box based on ML, are they going to worry what’s going to happen when they use the Black Box algorithm? According to Susan, the reason why firms and economists historically use simple models is that just by looking at the data it is difficult to understand whether the approach used is right. Whereas, using a Black box algorithm imparts some of the properties such as Interpretability, which helps in reasoning about the correctness of the approach. This helps researchers to make improvements in the model. Secondly, stability and robustness are also important for applications. Transfer learning helps estimate the model in one setting and use the same learning in some other setting. Also, these models will show fairness as many aspects of discrimination relates to correlation vs. causation. Finally, machine learning imparts a Human-like AI behavior that gives them the ability to make reasonable and never seen before decisions. All of these desired properties can be obtained in a causal model. The Causal Inference Framework In this framework, the goal is to learn a model of how the world works. For example, what happens to a body while a drug enters. Impact of intervention can be context specific. If a user learns something in a particular setting but it isn't working well in the other setting, it is not a problem with the framework. It’s, however, hard to do causal inference, there are some challenges including: We do not have the right kind of variation in the data. Lack of quasi-experimental data for estimation Unobserved contexts/confounders or insufficient data to control for observed confounders Analyst’s lack of knowledge about model Prof. Athey explains the true AI algorithm by using an example of contextual bandit under which there might be different treatments. In this example, one can select among alternative choices. They must have an explicit or implicit model of payoffs from alternatives. They also learn from past data. Here, the initial stages of learning have limited data, where there is a statistician inside the AI which performs counterfactual reasoning. A statistician should use best performing techniques (efficiency, bias). Counterfactual Inference Approaches Approach 1: Program Evaluation or Treatment Effect Estimation The goal of this approach is to estimate the impact of an intervention or treatment assignment policies. This literature focuses mainly on low dimensional interventions. Here, the estimands or the things that people want to learn is the average effect (Did it work?). For more sophisticated projects, people seek the heterogeneous effect (For whom did it work?) and optimal policy (policy mapping of people’s behavior to their assignments). The main goal here is to set confidence intervals around these effects to avoid bias or noisy sampling. This literature focuses on design that enables identification and estimation of these effects without using randomized experiments. Some of the designs include Regression discontinuity, difference-in-difference, and so on. Approach 2: Structural Estimation or ‘Generative models and counterfactuals’ Here the goal is to impact on welfare/profits of participants in alternative counterfactual regimes. These regimes may not have ever been observed in relevant contexts. These also need a behavioral model of participants. One can make use of Dynamic structural models to learn about value function from agent choices in different states. Approach 3: Causal discovery The goal of this approach is to uncover the causal structure of a system. Here the analyst believes that there is an underlying structure where some variables are causes of others, e.g. a physical stimulus leads to biological responses. Application of this can be found in understanding software systems and biological systems. [box type="shadow" align="" class="" width=""]Recent literature brings causal reasoning, statistical theory, and modern machine learning algorithms together to solve important problems. The difference between supervised learning and causal inference is that supervised learning can evaluate in a test set in a model‐free way. In causal inference, parameter estimation is not observed in a test set. Also, it requires theoretical assumptions and domain knowledge. [/box] Estimating ATE (Average Treatment Effects) under unconfoundedness Here only the observational data is available and only an analyst has access to the data that is sufficient for the part of the information used to assign units to treatments that is related to potential outcomes. The speaker here has used an example of how online Ads are targeted using cookies. The user sees car ads because the advertiser knows that the user has visited car reviewer websites. Here the purchases cannot be related to users who saw an ad versus the ones who did not. Hence, the interest in cars is the unobserved confounder. However, the analyst can see the history of the websites visited by the user. This is the main source of information for the advertiser about user interests. Using Supervised ML to measure estimate ATE under unconfoundedness The first supervised ML method is propensity score weighting or KNN on propensity score. For instance, make use of the LASSO regression model to estimate the propensity score. The second method is Regression adjustment which tries to estimate the further outcomes or access the features of further outcomes to get a causal effect. The next method is estimating CATE (Conditional average treatment effect) and take averages using the BART model. The method mentioned by Prof. Athey here is, Double robust/ double machine learning which uses cross-fitted augmented inverse propensity scores. Another method she mentioned was Residual Balancing which avoids assuming a sparse model thus allowing applications with a complex assignment. If unconfoundedness fails, the alternate assumption: there exists an instrumental variable Zi that is correlated with Wi (“relevance”) and where: Structural Models Structural models enable counterfactuals for never‐seen worlds. Combining Machine learning with structural model provides attention to identification, estimation using “good” exogenous variation in data. Also, adding a sensible structure improves performance required for never‐seen counterfactuals, increased efficiency for sparse data (e.g. longitudinal data) Nature of structure includes: Learning underlying preferences that generalize to new situations Incorporating nature of choice problem Many domains have established setups that perform well in data‐poor environments With the help of Discrete Choice Model, users can evaluate the impact of a new product introduction or the removal of a product from choice set. On combining these Discrete Choice Models with ML, we have two approaches to product interactions: Use information about product categories, assume products substitutes within categories Do not use available information about categories, estimate subs/complements Susan has concluded by mentioning some of the challenges on Causal inference, which include data sufficiency, finding sufficient/useful variation in historical data. She also mentions that recent advances in computational methods in ML don’t help with this. However, tech firms conducting lots of experiments, running bandits, and interacting with humans at large scale can greatly expand the ability to learn about causal effects! Head over to the Susan Athey’s entire tutorial on Counterfactual Inference at NeurIPS Facebook page. Researchers unveil a new algorithm that allows analyzing high-dimensional data sets more effectively, at NeurIPS conference Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk] NeurIPS 2018: A quick look at data visualization for Machine learning by Google PAIR researchers [Tutorial]
Read more
  • 0
  • 0
  • 17477
article-image-4-encryption-options-for-your-sql-server
Vijin Boricha
16 Apr 2018
7 min read
Save for later

4 Encryption options for your SQL Server

Vijin Boricha
16 Apr 2018
7 min read
In today’s tutorial, we will learn about cryptographic elements like T-SQL functions, service master key, and more. SQL Server cryptographic elements Encryption is the process of obfuscating data by the use of a key or password. This can make the data useless without the corresponding decryption key or password. Encryption does not solve access control problems. However, it enhances security by limiting data loss even if access controls are bypassed. For example, if the database host computer is misconfigured and a hacker obtains sensitive data, that stolen information might be useless if it is encrypted. SQL Server provides the following building blocks for the encryption; based on them you can implement all supported features, such as backup encryption, Transparent Data Encryption, column encryption and so on. We already know what the symmetric and asymmetric keys are. The basic concept is the same in SQL Server implementation. Later in the chapter you will practice how to create and implement all elements from the Figure 9-3. Let me explain the rest of the items. T-SQL functions SQL Server has built in support for handling encryption elements and features in the forms of T-SQL functions. You don't need any third-party software to do that, as you do with other database platforms. Certificates A public key certificate is a digitally-signed statement that connects the data of a public key to the identity of the person, device, or service that holds the private key. Certificates are issued and signed by a certification authority (CA). You can work with self-signed certificates, but you should be careful here. This can be misused for the large set of network attacks. SQL Server encrypts data with a hierarchical encryption. Each layer encrypts the layer beneath it using certificates, asymmetric keys, and symmetric keys. In a nutshell, the previous image means that any key in a hierarchy is guarded (encrypted) with the key above it. In practice, if you miss just one element from the chain, decryption will be impossible. This is an important security feature, because it is really hard for an attacker to compromise all levels of security. Let me explain the most important elements in the hierarchy. Service Master Key SQL Server has two primary applications for keys: a Service Master Key (SMK) generated on and for a SQL Server instance, and a database master key (DMK) used for a database. The SMK is automatically generated during installation and the first time the SQL Server instance is started. It is used to encrypt the next first key in the chain. The SMK should be backed up and stored in a secure, off-site location. This is an important step, because this is the first key in the hierarchy. Any damage at this level can prevent access to all encrypted data in the layers below. When the SMK is restored, the SQL Server decrypts all the keys and data that have been encrypted with the current SMK, and then encrypts them with the SMK from the backup. Service Master Key can be viewed with the following system catalog view: 1> SELECT name, create_date 2> FROM sys.symmetric_keys 3> GO name create_date ------------------------- ----------------------- ##MS_ServiceMasterKey## 2017-04-17 17:56:20.793 (1 row(s) affected) Here is an example of how you can back up your SMK to the /var/opt/mssql/backup folder. Note: In the case that you don't have /var/opt/mssql/backup folder execute all 5 bash lines. In the case you don't have permissions to /var/opt/mssql/backup folder execute all lines without first one. # sudo mkdir /var/opt/mssql/backup # sudo chown mssql /var/opt/mssql/backup/ # sudo chgrp mssql /var/opt/mssql/backup/ # sudo /opt/mssql/bin/mssql-conf set filelocation.defaultbackupdir /var/opt/mssql/backup/ # sudo systemctl restart mssql-server 1> USE master 2> GO Changed database context to 'master'. 1> BACKUP SERVICE MASTER KEY TO FILE = '/var/opt/mssql/backup/smk' 2> ENCRYPTION BY PASSWORD = 'S0m3C00lp4sw00rd' 3> --In the real scenarios your password should be more complicated 4> GO exit The next example is how to restore SMK from the backup location: 1> USE master 2> GO Changed database context to 'master'. 1> RESTORE SERVICE MASTER KEY 2> FROM FILE = '/var/opt/mssql/backup/smk' 3> DECRYPTION BY PASSWORD = 'S0m3C00lp4sw00rd' 4> GO You can examine the contents of your SMK with the ls command or some internal Linux file views, such is in Midnight Commander (MC). Basically there is not much to see, but that is the power of encryption. The SMK is the foundation of the SQL Server encryption hierarchy. You should keep a copy at an offsite location. Database master key The DMK is a symmetric key used to protect the private keys of certificates and asymmetric keys that are present in the database. When it is created, the master key is encrypted by using the AES 256 algorithm and a user-supplied password. To enable the automatic decryption of the master key, a copy of the key is encrypted by using the SMK and stored in both the database (user and in the master database). The copy stored in the master is always updated whenever the master key is changed. The next T-SQL code show how to create DMK in the Sandbox database: 1> CREATE DATABASE Sandbox 2> GO 1> USE Sandbox 2> GO 3> CREATE MASTER KEY 4> ENCRYPTION BY PASSWORD = 'S0m3C00lp4sw00rd' 5> GO Let's check where the DMK is with the sys.sysmmetric_keys system catalog view: 1> SELECT name, algorithm_desc 2> FROM sys.symmetric_keys 3> GO name algorithm_desc -------------------------- --------------- ##MS_DatabaseMasterKey## AES_256 (1 row(s) affected) This default can be changed by using the DROP ENCRYPTION BY SERVICE MASTER KEY option of ALTER MASTER KEY. A master key that is not encrypted by the SMK must be opened by using the OPEN MASTER KEY statement and a password. Now that we know why the DMK is important and how to create one, we will continue with the following DMK operations: ALTER OPEN CLOSE BACKUP RESTORE DROP These operations are important because all other encryption keys, on database-level, are dependent on the DMK. We can easily create a new DMK for Sandbox and re-encrypt the keys below it in the encryption hierarchy, assuming that we have the DMK created in the previous steps: 1> ALTER MASTER KEY REGENERATE 2> WITH ENCRYPTION BY PASSWORD = 'S0m3C00lp4sw00rdforN3wK3y' 3> GO Opening the DMK for use: 1> OPEN MASTER KEY 2> DECRYPTION BY PASSWORD = 'S0m3C00lp4sw00rdforN3wK3y' 3> GO Note: If the DMK was encrypted with the SMK, it will be automatically opened when it is needed for decryption or encryption. In this case, it is not necessary to use the OPEN MASTER KEY statement. Closing the DMK after use: 1> CLOSE MASTER KEY 2> GO Backing up the DMK: 1> USE Sandbox 2> GO 1> OPEN MASTER KEY 2> DECRYPTION BY PASSWORD = 'S0m3C00lp4sw00rdforN3wK3y'; 3> BACKUP MASTER KEY TO FILE = '/var/opt/mssql/backup/Snadbox-dmk' 4> ENCRYPTION BY PASSWORD = 'fk58smk@sw0h%as2' 5> GO Restoring the DMK: 1> USE Sandbox 2> GO 1> RESTORE MASTER KEY 2> FROM FILE = '/var/opt/mssql/backup/Snadbox-dmk' 3> DECRYPTION BY PASSWORD = 'fk58smk@sw0h%as2' 4> ENCRYPTION BY PASSWORD = 'S0m3C00lp4sw00rdforN3wK3y'; 5> GO When the master key is restored, SQL Server decrypts all the keys that are encrypted with the currently active master key, and then encrypts these keys with the restored master. Dropping the DMK: 1> USE Sandbox 2> GO 1> DROP MASTER KEY 2> GO You read an excerpt  from the book SQL Server on Linux, written by Jasmin Azemović.  From this book, you will learn to configure and administer database solutions on Linux. How SQL Server handles data under the hood SQL Server basics Creating reports using SQL Server 2016 Reporting Services  
Read more
  • 0
  • 0
  • 17469

article-image-hosting-service-iis-using-tcp-protocol
Packt
30 Oct 2014
8 min read
Save for later

Hosting the service in IIS using the TCP protocol

Packt
30 Oct 2014
8 min read
In this article by Mike Liu, the author of WCF Multi-layer Services Development with Entity Framework, Fourth Edtion, we will learn how to create and host a service in IIS using the TCP protocol. (For more resources related to this topic, see here.) Hosting WCF services in IIS using the HTTP protocol gives the best interoperability to the service, because the HTTP protocol is supported everywhere today. However, sometimes interoperability might not be an issue. For example, the service may be invoked only within your network with all Microsoft clients only. In this case, hosting the service by using the TCP protocol might be a better solution. Benefits of hosting a WCF service using the TCP protocol Compared to HTTP, there are a few benefits in hosting a WCF service using the TCP protocol: It supports connection-based, stream-oriented delivery services with end-to-end error detection and correction It is the fastest WCF binding for scenarios that involve communication between different machines It supports duplex communication, so it can be used to implement duplex contracts It has a reliable data delivery capability (this is applied between two TCP/IP nodes and is not the same thing as WS-ReliableMessaging, which applies between endpoints) Preparing the folders and files First, we need to prepare the folders and files for the host application, just as we did for hosting the service using the HTTP protocol. We will use the previous HTTP hosting application as the base to create the new TCP hosting application: Create the folders: In Windows Explorer, create a new folder called HostIISTcp under C:SOAwithWCFandEFProjectsHelloWorld and a new subfolder called bin under the HostIISTcp folder. You should now have the following new folders: C:SOAwithWCFandEFProjectsHelloWorld HostIISTcp and a bin folder inside the HostIISTcp folder. Copy the files: Now, copy all the files from the HostIIS hosting application folder at C:SOAwithWCFandEFProjectsHelloWorldHostIIS to the new folder that we created at C:SOAwithWCFandEFProjectsHelloWorldHostIISTcp. Create the Visual Studio solution folder: To make it easier to be viewed and managed from the Visual Studio Solution Explorer, you can add a new solution folder, HostIISTcp, to the solution and add the Web.config file to this folder. Add another new solution folder, bin, under HostIISTcp and add the HelloWorldService.dll and HelloWorldService.pdb files under this bin folder. Add the following post-build events to the HelloWorldService project, so next time, all the files will be copied automatically when the service project is built: xcopy "$(AssemblyName).dll" "C:SOAwithWCFandEFProjectsHelloWorldHostIISTcpbin" /Y xcopy "$(AssemblyName).pdb" "C:SOAwithWCFandEFProjectsHelloWorldHostIISTcpbin" /Y Modify the Web.config file: The Web.config file that we have copied from HostIIS is using the default basicHttpBinding as the service binding. To make our service use the TCP binding, we need to change the binding to TCP and add a TCP base address. Open the Web.config file and add the following node to it under the <system.serviceModel> node: <services> <service name="HelloWorldService.HelloWorldService">    <endpoint address="" binding="netTcpBinding"    contract="HelloWorldService.IHelloWorldService"/>    <host>      <baseAddresses>        <add baseAddress=        "net.tcp://localhost/HelloWorldServiceTcp/"/>      </baseAddresses>    </host> </service> </services> In this new services node, we have defined one service called HelloWorldService.HelloWorldService. The base address of this service is net.tcp://localhost/HelloWorldServiceTcp/. Remember, we have defined the host activation relative address as ./HelloWorldService.svc, so we can invoke this service from the client application with the following URL: http://localhost/HelloWorldServiceTcp/HelloWorldService.svc. For the file-less WCF activation, if no endpoint is defined explicitly, HTTP and HTTPS endpoints will be defined by default. In this example, we would like to expose only one TCP endpoint, so we have added an endpoint explicitly (as soon as this endpoint is added explicitly, the default endpoints will not be added). If you don't add this TCP endpoint explicitly here, the TCP client that we will create in the next section will still work, but on the client config file you will see three endpoints instead of one and you will have to specify which endpoint you are using in the client program. The following is the full content of the Web.config file: <?xml version="1.0"?> <!-- For more information on how to configure your ASP.NET application, please visit http://go.microsoft.com/fwlink/?LinkId=169433 --> <configuration> <system.web>    <compilation debug="true" targetFramework="4.5"/>    <httpRuntime targetFramework="4.5" /> </system.web>   <system.serviceModel>    <serviceHostingEnvironment >      <serviceActivations>        <add factory="System.ServiceModel.Activation.ServiceHostFactory"          relativeAddress="./HelloWorldService.svc"          service="HelloWorldService.HelloWorldService"/>      </serviceActivations>    </serviceHostingEnvironment>      <behaviors>      <serviceBehaviors>        <behavior>          <serviceMetadata httpGetEnabled="true"/>        </behavior>      </serviceBehaviors>    </behaviors>    <services>      <service name="HelloWorldService.HelloWorldService">        <endpoint address="" binding="netTcpBinding"         contract="HelloWorldService.IHelloWorldService"/>        <host>          <baseAddresses>            <add baseAddress=            "net.tcp://localhost/HelloWorldServiceTcp/"/>          </baseAddresses>        </host>      </service>    </services> </system.serviceModel>   </configuration> Enabling the TCP WCF activation for the host machine By default, the TCP WCF activation service is not enabled on your machine. This means your IIS server won't be able to host a WCF service with the TCP protocol. You can follow these steps to enable the TCP activation for WCF services: Go to Control Panel | Programs | Turn Windows features on or off. Expand the Microsoft .Net Framework 3.5.1 node on Windows 7 or .Net Framework 4.5 Advanced Services on Windows 8. Check the checkbox for Windows Communication Foundation Non-HTTP Activation on Windows 7 or TCP Activation on Windows 8. The following screenshot depicts the options required to enable WCF activation on Windows 7: The following screenshot depicts the options required to enable TCP WCF activation on Windows 8: Repair the .NET Framework: After you have turned on the TCP WCF activation, you have to repair .NET. Just go to Control Panel, click on Uninstall a Program, select Microsoft .NET Framework 4.5.1, and then click on Repair. Creating the IIS application Next, we need to create an IIS application named HelloWorldServiceTcp to host the WCF service, using the TCP protocol. Follow these steps to create this application in IIS: Open IIS Manager. Add a new IIS application, HelloWorldServiceTcp, pointing to the HostIISTcp physical folder under your project's folder. Choose DefaultAppPool as the application pool for the new application. Again, make sure your default app pool is a .NET 4.0.30319 application pool. Enable the TCP protocol for the application. Right-click on HelloWorldServiceTcp, select Manage Application | Advanced Settings, and then add net.tcp to Enabled Protocols. Make sure you use all lowercase letters and separate it from the existing HTTP protocol with a comma. Now the service is hosted in IIS using the TCP protocol. To view the WSDL of the service, browse to http://localhost/HelloWorldServiceTcp/HelloWorldService.svc and you should see the service description and a link to the WSDL of the service. Testing the WCF service hosted in IIS using the TCP protocol Now, we have the service hosted in IIS using the TCP protocol; let's create a new test client to test it: Add a new console application project to the solution, named HelloWorldClientTcp. Add a reference to System.ServiceModel in the new project. Add a service reference to the WCF service in the new project, naming the reference HelloWorldServiceRef and use the URL http://localhost/HelloWorldServiceTcp/HelloWorldService.svc?wsdl. You can still use the SvcUtil.exe command-line tool to generate the proxy and config files for the service hosted with TCP, just as we did in previous sections. Actually, behind the scenes Visual Studio is also calling SvcUtil.exe to generate the proxy and config files. Add the following code to the Main method of the new project: var client = new HelloWorldServiceRef.HelloWorldServiceClient (); Console.WriteLine(client.GetMessage("Mike Liu")); Finally, set the new project as the startup project. Now, if you run the program, you will get the same result as before; however, this time the service is hosted in IIS using the TCP protocol. Summary In this article, we created and tested an IIS application to host the service with the TCP protocol. Resources for Article: Further resources on this subject: Microsoft WCF Hosting and Configuration [Article] Testing and Debugging Windows Workflow Foundation 4.0 (WF) Program [Article] Applying LINQ to Entities to a WCF Service [Article]
Read more
  • 0
  • 0
  • 17467

article-image-francois-chollet-tensorflow-2-0-keras-integration-tricky-design-decisions-deep-learning
Sugandha Lahoti
10 Dec 2019
6 min read
Save for later

François Chollet, creator of Keras on TensorFlow 2.0 and Keras integration, tricky design decisions in Deep Learning, and more

Sugandha Lahoti
10 Dec 2019
6 min read
TensorFlow 2.0 was made available in October. One of the major highlights of this release was the integration of Keras into TensorFlow. Keras is an open-source deep-learning library that is designed to enable fast, user-friendly experimentation with deep neural networks. It serves as an interface to several deep learning libraries, most popular of which is TensorFlow, and it was integrated into TensorFlow main codebase in TensorFlow 2.0. In September, Lex Fridman, Research scientist at MIT popularly known for his podcasts, spoke to François Chollet, who is the author of Keras on Keras, Deep Learning, and the Progress of AI. In this post, we have tried to highlight François’ views on the Keras and TensorFlow 2.0 integration, early days of Keras and the importance of design decisions for building deep learning models. We recommend the full podcast that’s available on Fridman’s YouTube channel. Want to build Neural Networks? [box type="shadow" align="" class="" width=""]If you want to build multiple neural network architectures such as CNN, RNN, LSTM in Keras, we recommend you to read Neural Networks with Keras Cookbook by V Kishore Ayyadevara. This book features over 70 recipes such as object detection and classification, building self-driving car applications, understanding data encoding for image, text and recommender systems and more. [/box] Early days of Keras and how it was integrated into TensorFlow I started working on Keras in 2015, says Chollet. At that time Caffe was the popular deep learning library, based on C++ and was popular for building Computer Vision projects. Chollet was interested in Recurrent Neural Networks (RNNs) which was a niche topic at that time. Back then, there was no good solution or reusable open-source implementation of RNNs and LSTMs, so he decided to build his own and that’s how Keras started. “It was going to be mostly around RNNs and LSTMs and the models would be defined by Python code, which was going against mainstream,” he adds. Later, he joined Google’s research team working on image classification. At that time, he was exposed to the early internal version of Tensorflow - which was an improved version of Theano. When Tensorflow was released in 2015, he refactored Keras to run on TensorFlow. Basically he was abstracting away all the backend functionality into one module so that the same codebase could run on top of multiple backends. A year later, the TensorFlow team requested him to integrate the Keras API into TensorFlow more tightly.  They build a temporary TensorFlow-only version of Keras that was in tf.contrib for a while. Then they finally moved to TensorFlow Core in 2017. TensorFlow 2.0 gives both usability and flexibility to Keras Keras has been a very easy-to-use high-level interface to do deep learning. However, it lacked in flexibility - Keras framework was not the optimal way to do things compared to just writing everything from scratch. TensorFlow 2.0 offers both usability and flexibility to Keras. You have the usability of the high-level interface but you have the flexibility of the lower-level interface. You have this spectrum of workflows where you can get more or less usability and flexibility,  the trade-offs depending on your needs. It's very flexible, easy to debug, and powerful but also integrates seamlessly with higher-level features up to classic Keras workflows. “You have the same framework offering the same set of APIs that enable a spectrum of workflows that are more or less high level and are suitable for you know profiles ranging from researchers to data scientists and everything in between,” says Chollet. Design decisions are especially important while integrating Keras with Tensorflow “Making design decisions is as important as writing code”, claims Chollet. A lot of thought and care is taken in coming up with these decisions, taking into account the diverse user base of TensorFlow - small-scale production users, large-scale production users, startups, and researchers. Chollet says, “A lot of the time I spend on Google is actually discussing design. This includes writing design Docs, participating in design review meetings, etc.” Making a design decision is about satisfying a set of constraints but also trying to do so in the simplest way possible because this is what can be maintained and expanded in the future. You want to design APIs that are modular and hierarchical so that they have an API surface that is as small as possible. You want this modular hierarchical architecture to reflect the way that domain experts think about the problem. On the future of Keras and TensorFlow. What’s going to happen in TensorFlow 3.0? Chollet says that he’s really excited about developing even higher-level APIs with Keras. He’s also excited about hyperparameter tuning by automated machine learning. He adds, “The future is not just, you know, defining a model, it's more like an automatic model.” Limits of deep learning wrt function approximators that try to generalize from data Chollet emphasizes that “Neural Networks don't generalize well, humans do.” Deep Learning models are like huge parametric and differentiable models that go from an input space to an output space, trained with gradient descent. They are learning a continuous geometric morphing from an input vector space to an output space. As this is done point by point; a deep neural network can only make sense of points in space that are very close to things that it has already seen in string data. At best it can do the interpolation across points. However, that means in order to train your network you need a dense sampling of the input, almost a point-by-point sampling which can be very expensive if you're dealing with complex real-world problems like autonomous driving or robotics.  In contrast to this, you can look at very simple rules algorithms. If you have a symbolic rule it can actually apply to a very large set of inputs because it is abstract, it is not obtained by doing a point by point mapping. Deep learning is really like point by point geometric morphings. Meanwhile, abstract rules can generalize much better. I think the future is which can combine the two. Chollet also talks about self-improving Artificial General Intelligence, concerns about short-term and long-term threats in AI, Program synthesis, Good test for intelligence and more. The full podcast is available on Lex’s YouTube channel. If you want to implement neural network architectures in Keras for varied real-world applications, you may go through our book Neural Networks with Keras Cookbook. TensorFlow.js contributor Kai Sasaki on how TensorFlow.js eases web-based machine learning application development 10 key announcements from Microsoft Ignite 2019 you should know about What does a data science team look like?
Read more
  • 0
  • 0
  • 17461
article-image-social-media-enabled-and-amplified-christchurch-terrorist-attack
Fatema Patrawala
19 Mar 2019
11 min read
Save for later

How social media enabled and amplified the Christchurch terrorist attack

Fatema Patrawala
19 Mar 2019
11 min read
The recent horrifying terrorist attack in New Zealand has cast new blame on how technology platforms police content. There are now questions about whether global internet services are designed to work this way? And if online viral hate is uncontainable? Fifty one people so far have been reported to be dead and 50 more injured after the terrorist attacks on two New Zealand mosques on Friday. The victims included children as young as 3 and 4 years old, and elderly men and women. The alleged shooter is identified as a 28 year old Australian man named Brenton Tarrant. Brenton announced the attack on the anonymous-troll message board 8chan. There, he posted images of the weapons days before the attack, and made an announcement an hour before the shooting. On 8chan, Facebook and Twitter, he also posted links to a 74-page manifesto, titled “The Great Replacement,” blaming immigration for the displacement of whites in Oceania and elsewhere. The manifesto cites “white genocide” as a motive for the attack, and calls for “a future for white children” as its goal. Further he live-streamed the attacks on Facebook, YouTube; and posted a link to the stream on 8chan. It’s terrifying and disgusting, especially when 8chan is one of the sites where disaffected internet misfits create memes and other messages to provoke dismay and sow chaos among people. “8chan became the new digital home for some of the most offensive people on the internet, people who really believe in white supremacy and the inferiority of women,” Ethan Chiel wrote. “It’s time to stop shitposting,” the alleged shooter’s 8chan post reads, “and time to make a real-life effort post.” Many of the responses, anonymous by 8chan’s nature, celebrate the attack, with some posting congratulatory Nazi memes. A few seem to decry it, just for logistical quibbles. And others lament that the whole affair might destroy the site, a concern that betrays its users’ priorities. Social media encourages performance crime The use of social media technology and livestreaming marks the attack as different from many other terrorist incidents. It is a form of violent “performance crime”. That is, the video streaming is a central component of the violence itself, it’s not somehow incidental to the crime, or a trophy for the perpetrator to re-watch later. In the past, terrorism functioned according to what has been called the “theatre of terror”, which required the media to report on the spectacle of violence created by the group. Nowadays with social media in our hands it's much easier for someone to both create the spectacle of horrific violence and distribute it widely by themselves. There is a tragic and recent history of performance crime videos that use live streaming and social media video services as part of their tactics. In 2017, for example, the sickening murder video of an elderly man in Ohio was uploaded to Facebook, and the torture of a man with disabilities in Chicago was live streamed. In 2015, the murder of two journalists was simultaneously broadcast on-air, and live streamed. Tech companies on the radar Social-media companies scrambled to take action as the news—and the video—of the attack spread. Facebook finally managed to pull down Tarrant’s profiles and the video, but only after New Zealand police brought the live-stream to the company’s attention. It has been working "around the clock" to remove videos of the incident shared on its platform. In a statement posted to Twitter on Sunday, the tech company said that within 24 hours of Friday’s shooting it had removed 1.5 million videos of the attack from its platform globally. YouTube said it had also removed an “unprecedented volume” of videos of the shooting. Twitter also suspended Tarrant’s account, where he had posted links to the manifesto from several file-sharing sites. The chaotic aftermath mostly took place while many North Americans slept unaware, waking up to the news and its associated confusion. By morning on the East Coast, news outlets had already weighed in on whether technology companies might be partly to blame for catastrophes such as the New Zealand massacre because they have failed to catch offensive content before it spreads. One of the tweets say Google, Twitter and Facebook made a choice to not use tools available to them to stop white supremacist terrorism. https://twitter.com/samswey/status/1107055372949286912 Countries like Germany and France already have a law in place that demands social media sites move quickly to remove hate speech, fake news and illegal material. Sites that do not remove "obviously illegal" posts could face fines of up to 50m euro (£44.3m). In the wake of the attack, a consortium of New Zealand’s major companies has pledged to pull their advertising from Facebook. In a joint statement, the Association of New Zealand Advertisers (ANZA) and the Commercial Communications Council asked domestic companies to think about where “their advertising dollars are spent, and carefully consider, with their agency partners, where their ads appear.” They added, “We challenge Facebook and other platform owners to immediately take steps to effectively moderate hate content before another tragedy can be streamed online.” Additionally internet service providers like Vodafone, Spark and Vocus in New Zealand are blocking access to websites that do not respond or refuse to comply to requests to remove reuploads of the shooter’s original live stream. The free speech vs safety debate puts social media platforms in the crosshairs Tech Companies are facing new questions on content moderation following the New Zealand attack. The shooter posted a link to the live stream, and soon after he was apprehended, reuploads were found on other platforms like YouTube and Twitter. “Tech companies basically don’t see this as a priority,” the counter-extremism policy adviser Lucinda Creighton commented. “They say this is terrible, but what they’re not doing is preventing this from reappearing.” Others affirmed the importance of quelling the spread of the manifesto, video, and related materials, for fear of producing copycats, or of at least furthering radicalization among those who would be receptive to the message. The circulation of ideas might have motivated the shooter as much as, or even more than, ethnic violence. As Charlie Warzel wrote at The New York Times, the New Zealand massacre seems to have been made to go viral. Tarrant teased his intentions and preparations on 8chan. When the time came to carry out the act, he provided a trove of resources for his anonymous members, scattered to the winds of mirror sites and repositories. Once the live-stream started, one of the 8chan user posted “capped for posterity” on Tarrant’s thread, meaning that he had downloaded the stream’s video for archival and, presumably, future upload to other services, such as Reddit or 4chan, where other like-minded trolls or radicals would ensure the images spread even further. As Warzel put it, “Platforms like Facebook, Twitter, and YouTube … were no match for the speed of their users.” The internet is a Pandora’s box that never had a lid. Camouflaging stories is easy but companies trying hard in building AI to catch it Last year, Mark Zuckerberg defended himself and Facebook before Congress against myriad failures, which included Russian operatives disrupting American elections and permitting illegal housing ads that discriminate by race. Mark Zuckerberg repeatedly invoked artificial intelligence as a solution for the problems his and other global internet companies have created. There’s just too much content for human moderators to process, even when pressed hard to do so under poor working conditions. The answer, Zuckerberg has argued, is to train AI to do the work for them. But that technique has proved insufficient. That’s because detecting and scrubbing undesirable content automatically is extremely difficult. False positives enrage earnest users or foment conspiracy theories among paranoid ones, thanks to the black-box nature of computer systems. Worse, given a pool of billions of users, the clever ones will always find ways to trick any computer system, for example, by slightly modifying images or videos in order to make them appear different to the computer but identical to human eyes. 8chan, as it happens, is largely populated by computer-savvy people who have self-organized to perpetrate exactly those kinds of tricks. The primary sources of content are only part of the problem. Long after the deed, YouTube users have bolstered conspiracy theories about murders, successfully replacing truth with lies among broad populations of users who might not even know they are being deceived. Even stock-photo providers are licensing stills from the New Zealand shooter’s video; a Reuters image that shows the perpetrator wielding his rifle as he enters the mosque is simply credited, “Social media.” Interpreting real motives is difficult on social The video is just the tip of the iceberg. Many smaller and less obviously inflamed messages have no hope of being found, isolated, and removed by technology services. The shooter praised Donald Trump as a “symbol of renewed white identity” and incited the conservative commentator Candace Owens, who took the bait on Twitter in a post that got retweeted thousands of times by the morning after the attack. The shooter’s forum posts and video are littered with memes and inside references that bear special meaning within certain communities on 8chan, 4chan, Reddit, and other corners of the internet, offering tempting receptors for consumption and further spread. Perhaps worst of all, the forum posts, the manifesto, and even the shooting itself might not have been carried out with the purpose that a literal read of their contents suggests. At the first glance, it seems impossible to deny that this terrorist act was motivated by white-extremist hatred, an animosity that authorities like the FBI expert and the Facebook officials would want to snuff out before it spreads. But 8chan is notorious for users with an ironic and rude behaviour under the shades of anonymity.They use humor, memes and urban slang to promote chaos and divisive rhetoric. As the internet separates images from context and action from intention, and then spreads those messages quickly among billions of people scattered all around the globe. That structure makes it impossible to even know what individuals like Tarrant “really mean” by their words and actions. As it spreads, social-media content neuters earnest purpose entirely, putting it on the same level as anarchic randomness. What a message means collapses into how it gets used and interpreted. For 8chan trolls, any ideology might be as good as any other, so long as it produces chaos. We all have a role to play It’s easy to say that technology companies can do better. They can, and they should. But ultimately, content moderation is not the solution by itself. The problem is the media ecosystem they have created. The only surprise is that anyone would still be surprised that social media produce this tragic abyss, for this is what social media are supposed to do, what they were designed to do: spread the images and messages that accelerate interest and invoke raw emotions, without check, and absent concern for their consequences. We hope that social media companies get better at filtering out violent content and explore alternative business models, and governments think critically about cyber laws that protect both people and speech. But until they do we should reflect on our own behavior too. As news outlets, we shape the narrative through our informed perspectives which makes it imperative to publish legitimate & authentic content. Let’s as users too make a choice of liking and sharing content on social platforms. Let’s consider how our activities could contribute to an overall spectacle society that might inspire future perpetrator-produced videos of such gruesome crime – and act accordingly. In this era of social spectacle, we all have a role to play in ensuring that terrorists aren’t rewarded for their crimes with our clicks and shares. The Indian government proposes to censor social media content and monitor WhatsApp messages Virality of fake news on social media: Are weaponized AI bots to blame, questions Destin Sandlin Mastodon 2.7, a decentralized alternative to social media silos, is now out!
Read more
  • 0
  • 0
  • 17401

article-image-neurips-2018-rethinking-transparency-and-accountability-in-machine-learning
Bhagyashree R
16 Dec 2018
8 min read
Save for later

NeurIPS 2018: Rethinking transparency and accountability in machine learning

Bhagyashree R
16 Dec 2018
8 min read
Key takeaways from the discussion To solve problems with machine learning, you must first understand them. Different people or groups of people are going to define a problem in a different way. So, we shouldn't believe that the way we want to frame the problem computationally is the right way. If we allow that our systems include people and society, it is clear that we have to help negotiate values, not simply define them. Last week, at the 32nd NeurIPS 2018 annual conference, Nitin Koli, Joshua Kroll, and Deirdre Mulligan presented the common pitfalls we see when studying the human side of machine learning. Machine learning is being used in high-impact areas like medicine, criminal justice, employment, and education for making decisions. In recent years, we have seen that this use of machine learning and algorithmic decision making have resulted in unintended discrimination.  It’s becoming clear that even models developed with the best of intentions may exhibit discriminatory biases and perpetuate inequality. Although researchers have been analyzing how to put concepts like fairness, accountability, transparency, explanation, and interpretability into practice in machine learning, properly defining these things can prove a challenge. Attempts have been made to define them mathematically, but this can bring new problems. This is because applying mathematical logic to human concepts that have unique and contested political and social dimensions necessarily has blind spots - every point of contestation can’t be integrated into a single formula. In turn, this can cause friction with other disciplines as well as the public. Based on their research on what various terms mean in different contexts, Nitin Koli, Joshua Krill, and Deirdre Mulligan drew out some of the most common misconceptions machine learning researchers and practitioners hold. Sociotechnical problems To find a solution to a particular problem, data scientists need precise definitions. But how can we verify that these definitions are correct? Indeed, many definitions will be contested, depending on who you are and what you want them to mean. A definition that is fair to you will not necessarily be fair to me”, remarks Mr. Kroll. Mr. Kroll explained that while definitions can be unhelpful, they are nevertheless essential from a mathematical perspective.  This means there appears to be an unresolved conflict between concepts and mathematical rigor. But there might be a way forward. Perhaps it’s wrong to simply think in this dichotomy of logical rigor v. the messy reality of human concepts. One of the ways out of this impasse is to get beyond this dichotomy. Although it’s tempting to think of the technical and mathematical dimension on one side, with the social and political aspect on the other, we should instead see them as intricately related. They are, Kroll suggests, socio-technical problems. Kroll goes on to say that we cannot ignore the social consequences of machine learning: “Technologies don’t live in a vacuum and if we pretend that they do we kind of have put our blinders on and decided to ignore any human problems.” Fairness in machine learning In the real world, fairness is a concept directly linked to processes. Think, for example, of the voting system. Citizens cast votes to their preferred candidates and the candidate who receives the most support is elected. Here, we can say that even though the winning candidate was not the one a candidate voted for, but at least he/she got the chance to participate in the process. This type of fairness is called procedural fairness. However, in the technical world, fairness is often viewed in a subtly different way. When you place it in a mathematical context, fairness centers on outcome rather than process. Kohli highlighted that trade offs between these different concepts can’t be avoided. They’re inevitable. A mathematical definition of fairness places a constraint over the behavior of a system, and this constraint will narrow down the cause of models that can satisfy these conditions. So, if we decide to add too many fairness constraints to the system, some of them will be self-contradictory. One more important point machine learning practitioners should keep in mind is that when we talk about the fairness of a system, that system isn’t a self-contained and coherent thing. It is not a logical construct - it’s a social one. This means there are a whole host of values, ideas, and histories that have an impact on its reality.. In practice, this ultimately means that the complexity of the real world from which we draw and analyze data can have an impact on how a model works. Kohli explained this by saying, “it doesn’t really matter... whether you are building a fair system if the context in which it is developed and deployed in is fundamentally unfair.” Accountability in machine learning Accountability is ultimately about trust. It’s about the extent you can be sure you know what is ‘true’ about a system. It refers to the fact that you know how it works and why it does things in certain ways. In more practical terms, it’s all about invariance and reliability. To ensure accountability inside machine learning models, we need to follow a layered model. The bottom layer is an accounting or recording layer, that keeps track of what a given system is doing and the ways in which it might have been changed.. The next layer is a more analytical layer. This is where those records on the bottom layer are analyzed, with decisions made about performance - whether anything needs to be changed and how they should be changed. The final and top-most layer is about responsibility. It’s where the proverbial buck stops - with those outside of the algorithm, those involved in its construction. “Algorithms are not responsible, somebody is responsible for the algorithm,”  explains Kroll. Transparency Transparency is a concept heavily tied up with accountability. Arguably you have no accountability without transparency. The layered approach discussed above should help with transparency, but it’s also important to remember that transparency is about much more than simply making data and code available. Instead, it demands that the decisions made in the development of the system are made available and clear too. Mr. Kroll emphasizes, “to the person at the ground-level for whom the decisions are being taken by some sort of model, these technical disclosures aren’t really useful or understandable.” Explainability In his paper Explanation in Artificial Intelligence: Insights from the Social Sciences, Tim Miller describes what is explainable artificial intelligence. According to Miller, explanation takes many forms such as causal, contrastive, selective, and social. Causal explanation gives reasons behind why something happened, for example, while contrastive explanations can provide answers to questions like“Why P rather than not-P?". But the most important point here is that explanations are selective. An explanation cannot include all reasons why something happened; explanations are always context-specific, a response to a particular need or situation. Think of it this way: if someone asks you why the toaster isn’t working, you could just say that it’s broken. That might be satisfactory in some situations, but you could, of course, offer a more substantial explanation, outlining what was technically wrong with the toaster, how that technical fault came to be there, how the manufacturing process allowed that to happen, how the business would allow that manufacturing process to make that mistake… you could, of course, go on and on. Data is not the truth Today, there is a huge range of datasets available that will help you develop different machine learning models. These models can be useful, but it’s essential to remember that they are models. A model isn’t the truth - it’s an abstraction, a representation of the world in a very specific way. One way of taking this fact into account is the concept of ‘construct validity’. This sounds complicated, but all it really refers to is the extent to which a test - say a machine learning algorithm - actually measures what it says it’s trying to measure. The concept is widely used in disciplines like psychology, but in machine learning, it simply refers to the way we validate a model based on its historical predictive accuracy. In a nutshell, it’s important to remember that just as data is an abstraction of the world, models are also an abstraction of the data. There’s no way of changing this, but having an awareness that we’re dealing in abstractions ensures that we do not lapse into the mistake of thinking we are in the realm of ‘truth’. To build a fair(er) systems will ultimately require an interdisciplinary approach, involving domain experts working in a variety of fields. If machine learning and artificial intelligence is to make a valuable and positive impact in fields such as justice, education, and medicine, it’s vital that those working in those fields work closely with those with expertise in algorithms. This won’t fix everything, but it will be a more robust foundation from which we can begin to move forward. You can watch the full talk on the Facebook page of NeurIPS. Researchers unveil a new algorithm that allows analyzing high-dimensional data sets more effectively, at NeurIPS conference Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk] NeurIPS 2018: A quick look at data visualization for Machine learning by Google PAIR researchers [Tutorial]
Read more
  • 0
  • 0
  • 17388
Modal Close icon
Modal Close icon