Data | Tech News, Tutorials & Expert Insights

article-image-what-is-a-convolutional-neural-network-cnn-video

25 Sep 2018

5 min read

What is a convolutional neural network (CNN)? [Video]

25 Sep 2018

0
0
35775

article-image-how-far-will-facebook-go-to-fix-what-it-broke-democracy-trust-reality

Aarthi Kumaraswamy

24 Sep 2018

19 min read

How far will Facebook go to fix what it broke: Democracy, Trust, Reality

Aarthi Kumaraswamy

24 Sep 2018

19 min read

Facebook, along with other tech media giants, like Twitter and Google, broke the democratic process in 2016. Facebook also broke the trust of many of its users as scandal after scandal kept surfacing telling the same story in different ways - the story of user data and trust abused in exchange for growth and revenue. The week before last, Mark Zuckerberg posted a long explanation on Facebook titled ‘Preparing for Elections’. It is the first of a series of reflections by Zuckerberg that ‘address the most important issues facing Facebook’. That post explored what Facebook is doing to avoid ending up in a situation similar to the 2016 elections when the platform ‘inadvertently’ became a super-effective channel for election interference of various kinds. It follows just weeks after Facebook COO, Sheryl Sandberg appeared in front of a Senate Intelligence hearing alongside Twitter CEO, Jack Dorsey on the topic of social media’s role in election interference. Zuckerberg’s mobile-first rigor oversimplifies the issues Zuckerberg opened his post with a strong commitment to addressing the issues plaguing Facebook using the highest levels of rigor the company has known in its history. He wrote, “I am bringing the same focus and rigor to addressing these issues that I've brought to previous product challenges like shifting our services to mobile.” To understand the weight of this statement we must go back to how Facebook became a mobile-first company that beat investor expectations wildly. Suffice to say it went through painful years of restructuring and reorientation in the process. Those unfamiliar with that phase of Facebook, please read the section ‘How far did Facebook go to become a mobile-first company?’ at the end of this post for more details. To be fair, Zuckerberg does acknowledge that pivoting to mobile was a lot easier than what it will take to tackle the current set of challenges. He writes, “These issues are even harder because people don't agree on what a good outcome looks like, or what tradeoffs are acceptable to make. When it comes to free expression, thoughtful people come to different conclusions about the right balances. When it comes to implementing a solution, certainly some investors disagree with my approach to invest so much on security. We have a lot of work ahead, but I am confident we will end this year with much more sophisticated approaches than we began, and that the focus and investments we've put in will be better for our community and the world over the long term.” However, what Zuckerberg does not acknowledge in the above statement is that the current set of issues is not merely a product challenge, but a business ethics and sustainability challenge. Unless ‘an honest look in the mirror’ kind of analysis is done on that side of Facebook, any level of product improvements will only result in cosmetic changes that will end in an ‘operation successful, patient dead’ scenario. In the coming sections, I attempt to dissect Zuckerberg’s post in the context of the above points by reading between the lines to see how serious the platform really is about changing its ways to ‘be better for our community and the world over the long term’. Why does Facebook’s commitment to change feel hollow? Let’s focus on election interference in this analysis as Zuckerberg limits his views to this topic in his post. Facebook has been at the center of this story on many levels. Here is some context on where Zuckerberg is coming from. Facebook’s involvement in the 2016 election meddling Apart from the traditional cyber-attacks (which they had even back then managed to prevent successfully), there were Russia-backed coordinated misinformation campaigns found on the platform. Then there was also the misuse of its user data by data analytics firm, Cambridge Analytica, which consulted on election campaigning. They micro-profiled users based on their psychographics (the way they think and behave) to ensure more effective ad spending by political parties. There was also the issue of certain kinds of ads, subliminal messages and peer pressure sent out to specific Facebook users during elections to prompt them to vote for certain candidates while others did not receive similar messages. There were also alleged reports of a certain set of users having been sent ‘dark posts’ (posts that aren’t publicly visible to all, but visible only to those on the target list) to discourage them from voting altogether. It also appears that Facebook staff offered both the Clinton and the Trump campaigns to assist with Facebook advertising. The former declined the offer while the latter accepted. We don’t know which of the above and to what extent each of these decisions and actions impacted the outcome of the 2016 US presidential elections. But one thing is certain, collectively they did have a significant enough impact for Zuckerberg and team to acknowledge these are serious problems that they need to address, NOW! Deconstructing Zuckerberg’s ‘Protecting Elections’ Before diving into what is problematic about the measures that are taken (or not taken) by Facebook, I must commend them for taking ownership of their role in election interference in the past and for attempting to rectify the wrongs. I like that Zuckerberg has made himself vulnerable by sharing his corrective plans with the public while it is a work in progress and is engaging with the public at a personal level. Facebook’s openness to academic research using anonymized Facebook data and their willingness to permit publishing findings without Facebook’s approval is also noteworthy. Other initiatives such as the political ad transparency report, AI enabled fake account & fake news reduction strategy, doubling the content moderator base, improving their recommendation algorithms are all steps in the right direction. However, this is where my list of nice things to say ends. The overall tone of Zuckerberg’s post is that of bargaining rather than that of acceptance. Interestingly this was exactly the tone adopted by Sandberg as well in the Senate hearing earlier this month, down to some very similar phrases. This makes one question if everything isn’t just one well-orchestrated PR disaster management plan. Disappointingly, most of the actions stated in Zuckerberg's post feel like half-measures; I get the sense that they aren’t willing to go the full distance to achieve the objectives they set for themselves. I hope to be wrong. 1. Zuckerberg focuses too much on ‘what’ and ‘how’, is ignoring the ‘why’ Zuckerberg identifies three key issues he wants to address in 2018: preventing election interference, protecting the community from abuse, and providing users with better control over their information. This clarity is a good starting point. In this post, he only focuses on the first issue. So I will reserve sharing my detailed thoughts on the other two for now. What I would say for now is that the key to addressing all issues on Facebook is taking a hard look at Facebook policies, including privacy, from a mission statement perspective. In other words, be honest about ‘Why Facebook exists’. Users are annoyed, advertisers are not satisfied and neither are shareholders confident about Facebook’s future. Trying to be everyone’s friend is clearly not working for Facebook. As such, I expected this in the opening part of the series. ‘Be better for our community and the world over the long term’ is too vague of a mission statement to be of any practical use. 2. Political Ad transparency report is necessary, but not sufficient In May this year, Facebook released its first political ad transparency report as a gesture to show its commitment to minimizing political interference. The report allows one to see who sponsored which issue advertisement and for how much. This was a move unanimously welcomed by everyone and soon others like Twitter and Google followed suit. By doing this, Facebook hopes to allow its users to form more informed views about political causes and other issues. Here is my problem with this feature. (Yes, I do view this report as a ‘feature’ of the new Facebook app which serves a very specific need: to satisfy regulators and media.) The average Facebook user is not the politically or technologically savvy consumer. They use Facebook to connect with friends and family and maybe play silly games now and then. The majority of these users aren’t going to proactively check out this ad transparency report or the political ad database to arrive at the right conclusions. The people who will find this report interesting are academic researchers, campaign managers, and analysts. It is one more rich data point to understand campaign strategy and thereby infer who the target audience is. This could most likely lead to a downward spiral of more and more polarizing ads from parties across the spectrum. 3. How election campaigning, hate speech, and real violence are linked but unacknowledged Another issue closely tied with political ads is hate speech and violence-inciting polarising content that aren’t necessarily paid ads. These are typical content in the form of posts, images or videos that are posted as a response to political ads or discourses. These act as carriers that amplify the political message, often in ways unintended by the campaigners themselves. The echo chambers still exist. And the more one's ecosystem or ‘look-alike audience’ responds to certain types of ads or posts, users are more likely to keep seeing them, thanks to Facebook's algorithms. Seeing something that is endorsed by one’s friends often primes one to trust what is said without verifying the facts for themselves thus enabling fake news to go viral. The algorithm does the rest to ensure everyone who will engage with the content sees it. Newsy political ads will thrive in such a setup while getting away with saying ‘we made full disclosure in our report’. All of this is great for Facebook’s platform as it not only gets great engagement from the content but also increased ad spendings from all political parties as they can’t afford to be missing from action on Facebook. A by-product of this ultra-polarised scenario though is more protectionism and less free, open and meaningful dialog and debate between candidates as well as supporters on the platform. That’s bad news for the democratic process. 4. Facebook’s election interference prevention model is not scalable Their single-minded focus on eliminating US election interference on Facebook’s platforms through a multipronged approach to content moderation is worth appreciating. This also makes one optimistic about Facebook’s role in consciously attempting to do the right thing when it comes to respecting election processes in other nations as well. But the current approach of creating an ‘election war room’ is neither scalable nor sustainable. What happens everytime a constituency in the US has some election or some part of the world does? What happens when multiple elections take place across the world simultaneously? Who does Facebook prioritize to provide election interference defense support and why? Also, I wouldn’t go too far to trust that they will uphold individual liberties in troubled nations with strong regimes or strong divisive political discourses. What happens when the ruling party is the one interfering with the elections? Who is Facebook answerable to? 5. Facebook’s headcount hasn’t kept up with its own growth ambitions Zuckerberg proudly states in his post that they’ve deleted a billion fake accounts with machine learning and have double the number of people hired to work on safety and security. "With advances in machine learning, we have now built systems that block millions of fake accounts every day. In total, we removed more than one billion fake accounts -- the vast majority within minutes of being created and before they could do any harm -- in the six months between October and March. ....it is still very difficult to identify the most sophisticated actors who build their networks manually one fake account at a time. This is why we've also hired a lot more people to work on safety and security -- up from 10,000 last year to more than 20,000 people this year." ‘People working on safety and security’ could have a wide range of job responsibilities from network security engineers to security guards hired at Facebook offices. What is missing conspicuously in the above picture is a breakdown of the number of people hired specifically to fact check, moderate content and resolve policy related disputes and review flagged content. With billions of users posting on Facebook, the job of content moderators and policy enforcers, even when assisted by algorithms, is massive. It is important that they are rightly incentivized to do their job well and are set clear and measurable goals. The post neither talks of how Facebook plans to reward moderators and neither does it talk about what the yardsticks for performance in this area would be. Facebook fails to acknowledge that it is not fully prepared, partly because it is understaffed. 6. The new Product Policy Director, human rights role is a glorified Public Relations job The weekend following Zuckerberg’s post, a new job opening appeared on Facebook’s careers page for the position of ‘Product policy director, human rights’. Below snippet is taken from that job posting. Source: Facebook careers The above is typically what a Public relations head does as well. Not only are the responsibilities cited above heavily communication and public perception building based, there’s not much given in terms of authority to this role to influence how other teams achieve their goals. Simply put, this role ‘works with, coordinates or advises teams’, it does not ‘guide or direct teams’. Als,o another key point to observe is that this role aims to add another layer of distance to further minimize exposure for Zuckerberg, Sandberg and other top key executives in public forums such as congressional hearings or press meets. Any role/area that is important to a business typically finds a place at the C-suite table. Had this new role been one of the c-suite roles it would have been advertised so, and it may have had some teeth. Of the 24 key executives in Facebook, only one is concerned with privacy and policy, ‘Chief Privacy Officer & VP of U.S. Public Policy’. Even this role does not have a global directive or public welfare in mind. On the other hand, there are multiple product development, creative and business development roles on Facebook’s c-suite. There is even a separate watch product head, a messaging product head, and one just dedicated to China called ‘Head of Creative Shop - Greater China’. This is why Facebook’s plan to protect elections will fail I am afraid Facebook’s greatest strength is also it’s Achilles heel. The tech industry’s deified hacker culture is embodied perfectly by Facebook. Facebook’s ad revenue based flawed business model is the ingenious creation of that very hacker culture. Any attempts to correct everything else is futile without correcting the issues with the current model. The ad revenue based model is why the Facebook app is designed the way it is: with ‘relevant’ news feeds, filter bubbles and look-alike audience segmentation. It is the reason why viral content gets rewarded irrespective of its authenticity or the impact it has on society. It is also the reason why Facebook has a ‘move fast and break things’ internal culture where growth at all costs is favored and idolized. Facebook’s Q2 2018 Earnings summary highlights the above points succinctly. Source: Facebook's SEC Filing The above snapshot means that even if we assume all 30k odd employees do some form of content moderation (the probability of which is zero), every employee is responsible for 50k users’ content daily. Let’s say every user only posts 1 post a day. If we assume Facebook’s news feed algorithms are super efficient and only find 2% of the user content questionable/fake (as speculated by Sandberg in her Senate hearing this month), that would still mean nearly 1k posts per person to review every day! What can Facebook do to turn over a new leaf? Unless Facebook attempts to sincerely address at least some of the below, I will continue to be skeptical of any number of beautifully written posts by Zuckerberg or patriotically orated speeches by Sandberg. A content moderation transparency report that shares not just the number of posts moderated, the number of people working to moderate content on Facebook but also the nature of content moderated, the moderators’ job satisfaction levels, their tenure, qualifications, career aspirations, their challenges, and how much Facebook is investing in people, processes and technology to make its platform safe and objective for everyone to engage with others. A general Ad transparency report that not only lists advertisers on Facebook but also their spendings and chosen ad filters for the public and academia to review or analyze any time. Taking responsibility for the real-world consequences of actions enabled by Facebook. Like the recent gender and age discrimination employment ads shown on Facebook. Really banning hate speech and fake viral content. Bring in a business/AI ethics head who is only next to Zuckerberg and equal to Sandberg’s COO role. Exploring and experimenting with other alternative revenue channels to tackle the current ad-driven business model problem. Resolving the UI problem so that users can gain back control over their data and make it easy to choose to not participate in Facebook’s data experiments. This would mean a potential loss in some ad revenue. The ‘grow hacker’ culture problem that is a byproduct of years of moving fast and breaking things. This would mean a significant change in behavior by everyone starting from the top and probably restructuring the way teams are organized and business is done. It would also mean a different definition and measurement of success which could lead to shareholder backlash. But Mark is uniquely placed to withstand these pressures given his clout over the board voting powers. Like Augustus Caesar his role model, Zuckerberg has a chance to make history. But he might have to put the company through hard and sacrificing times in exchange for the proverbial 200 years of world peace. He’s got the best minds and limitless resources at his disposal to right what he and his platform wronged. But he would have to make enemies with the hands that feed him. Would he rise to the challenge? Like Augustus who is rumored to have killed his grandson, will Zuckerberg ever be prepared to kill his ad revenue generating brainchild? In the meanwhile, we must not underestimate the power of good digital citizenry. We must continue to fight the good fight to move tech giants like Facebook in the right direction. Just as persistent trickling water droplets can erode mountains and create new pathways, so can our mindful actions as digital platform users prompt major tech reforms. It could be as bold as deleting one's Facebook account (I haven’t been on the platform for years now, and I don’t miss it at all). You could organize groups to create awareness on topics like digital privacy, fake news, filter bubbles, or deliberately choose to engage with those whose views differ from yours to understand their perspective on topics and thereby do your part in reversing algorithmically accentuated polarity. It could also be by selecting the right individuals to engage in informed dialog with tech conglomerates. Not every action needs to be hard though. It could be as simple as customizing your default privacy settings or choosing to only spend a select amount of time on such platforms, or deciding to verify the authenticity and assessing the toxicity of a post you wish to like, share or forward to your network. Addendum How far did Facebook go to become a mobile-first company? Following are some of the things Facebook did to become the largest mobile advertising platform in the world, surpassing Google by a huge margin. Clear purpose and reason for the change: “For one, there are more mobile users. Second, they’re spending more time on it... third, we can have better advertising on mobile, make more money,” said Zuckerberg at TechCrunch Disrupt back in 2012 on why they were becoming mobile first. In other words, there was a lot of growth and revenue potential in investing in this space. This was a simple and clear ‘what’s in it for me’ incentive for everyone working to make the transition as well for stockholders and advertisers to place their trust in Zuckerberg’s endeavors. Setting company-wide accountability: “We realigned the company around, so everybody was responsible for mobile.”, said the then President of Business and Marketing Partnerships David Fischer to Fortune in 2013. Willing to sacrifice desktop for mobile: Facebook decided to make a bold gamble to lose its desktop users to grow its unproven mobile platform. Essentially it was willing to bet its only cash cow for a dark horse that was dependent on so many other factors to go right. Strict consequences for non-compliance: Back in the days of transitioning to a mobile-first company Zuckerberg famously said to all his product teams that when they went in for reviews: “Come in with mobile. If you come in and try to show me a desktop product, I’m going to kick you out. You have to come in and show me a mobile product.” Expanding resources and investing in reskilling: They grew their team of 20 mobile engineers to literally all engineers at Facebook undergoing training courses on iOS and Android development. “we’ve completely changed the way we do product development. We’ve trained all our engineers to do mobile first.”, said Facebook’s VP of corporate development, Vaughan Smith to TechCrunch by the end of 2012. Realigning product design philosophy: Designed custom features for the mobile-first interface instead of trying to adapt the features for the web to mobile. In other words, they began with mobile as their default user interface. Local and global user behavior sensitization: Some of their engineering teams even did field visits to developing nations like the Philippines to see first hand how mobile apps are being used there. Environmental considerations in app design: Facebook even had the foresight to consider scenarios where mobile users may not have quality internet signals or poor quality mobile battery related issues. They designed their apps keeping these future needs in mind.

0
0
30437

article-image-how-facebook-is-advancing-artificial-intelligence-video

Richard Gall

14 Sep 2018

4 min read

How Facebook is advancing artificial intelligence [Video]

Richard Gall

14 Sep 2018

4 min read

0
0
18689

article-image-getting-started-with-amazon-machine-learning-workflow-tutorial

Melisha Dsouza

02 Sep 2018

14 min read

Getting started with Amazon Machine Learning workflow [Tutorial]

Melisha Dsouza

02 Sep 2018

14 min read

Amazon Machine Learning is useful for building ML models and generating predictions. It also enables the development of robust and scalable smart applications. The process of building ML models with Amazon Machine Learning consists of three operations: data analysis model training evaluation. The code files for this article are available on Github. This tutorial is an excerpt from a book written by Alexis Perrier titled Effective Amazon Machine Learning. The Amazon Machine Learning service is available at https://console.aws.amazon.com/machinelearning/. The Amazon ML workflow closely follows a standard Data Science workflow with steps: Extract the data and clean it up. Make it available to the algorithm. Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part. Select the best model by training several models on the training dataset and comparing their performances on the validation dataset. Use the best model for predictions on new data. As shown in the following Amazon ML menu, the service is built around four objects: Datasource ML model Evaluation Prediction The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. Let us take a closer look at each one of these steps. Understanding the dataset used We will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm. This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. We do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable. We will randomly select 20% of the rows as the held-out subset to use for prediction on previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor: By creating a new column with randomly generated numbers Sorting the spreadsheet by that column Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split) Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967. As with all datasets, scripts, and resources mentioned in this book, the training and holdout files are available in the GitHub repository at https://github.com/alexperrier/packt-aml. It is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model. Loading the data on S3 Follow these steps to load the training and held-out datasets on S3: Go to your s3 console at https://console.aws.amazon.com/s3. Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration. Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu: Both files are small, only a few KB, and hosting costs should remain negligible for that exercise. Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on. Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv. Declaring a datasource Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default: As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user: Specifying a Datasource name is useful to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot: Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents. Creating the datasource An Amazon ML datasource is composed of the following: The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3 The schema that contains information on the type of the variables contained in the CSV file: Categorical Text Numeric (real-valued) Binary It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML. At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has: Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot: Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following: The first line in the CSV file contains te column name The target is the weight We see here that Amazon ML has correctly inferred the following: sex is categorical age, height, and weight are numeric (continuous real values) Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data: predicted weight = a * age + b * height + c * sex Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are useful when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model. You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model. Understanding the model We select the default parameters for the training and evaluation settings. Amazon ML will do the following: Create a recipe for data transformation based on the statistical properties it has inferred from the dataset Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially. The recipe will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed. The default advanced settings for the model are shown in the following screenshot: We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on. Regularization comes in 3 levels with a mild (10^-6), medium (10^-4), or aggressive (10^-02) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^-6) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set). Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending. At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation. It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following: { "splitting": { "percentBegin": 0, "percentEnd": 70 } } You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot: While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters: Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100. Making several passes over the training data while shuffling the samples before each path. At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made. At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version. Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation. Evaluating the model Amazon ML uses the standard metric RMSE for linear regression. RMSE is defined as the sum of the squares of the difference between the real values and the predicted values: Here, ŷ is the predicted values, and y the real values we want to predict (the weight of the children in our case). The closer the predictions are to the real values, the lower the RMSE is. A lower RMSE means a better, more accurate prediction. Making batch predictions We now have a model that has been properly trained and selected among other models. We can use it to make predictions on new data. A batch prediction consists in applying a model to a datasource in order to make predictions on that datasource. We need to tell Amazon ML which model we want to apply on which data. Batch predictions are different from streaming predictions. With batch predictions, all the data is already made available as a datasource, while for streaming predictions, the data will be fed to the model as it becomes available. The dataset is not available beforehand in its entirety. In the Main Menu select Batch Predictions to access the dashboard predictions and click on Create a New Prediction: The first step is to select one of the models available in your model dashboard. You should choose the one that has the lowest RMSE: The next step is to associate a datasource to the model you just selected. We had uploaded the held-out dataset to S3 at the beginning of this chapter (under the Loading the data on S3 section) but had not used it to create a datasource. We will do so now.When asked for a datasource in the next screen, make sure to check My data is in S3, and I need to create a datasource, and then select the held-out dataset that should already be present in your S3 bucket: Don't forget to tell Amazon ML that the first line of the file contains columns. In our current project, our held-out dataset also contains the true values for the weight of the students. This would not be the case for "real" data in a real-world project where the real values are truly unknown. However, in our case, this will allow us to calculate the RMSE score of our predictions and assess the quality of these predictions. The final step is to click on the Verify button and wait for a few minutes: Amazon ML will run the model on the new datasource and will generate predictions in the form of a CSV file. Contrary to the evaluation and model-building phase, we now have real predictions. We are also no longer given a score associated with these predictions. After a few minutes, you will notice a new batch-prediction folder in your S3 bucket. This folder contains a manifest file and a results folder. The manifest file is a JSON file with the path to the initial datasource and the path to the results file. The results folder contains a gzipped CSV file: Uncompressed, the CSV file contains two columns, trueLabel, the initial target from the held-out set, and score, which corresponds to the predicted values. We can easily calculate the RMSE for those results directly in the spreadsheet through the following steps: Creating a new column that holds the square of the difference of the two columns. Summing all the rows. Taking the square root of the result. The following illustration shows how we create a third column C, as the squared difference between the trueLabel column A and the score (or predicted value) column B: As shown in the following screenshot, averaging column C and taking the square root gives an RMSE of 11.96, which is even significantly better than the RMSE we obtained during the evaluation phase (RMSE 14.4): The fact that the RMSE on the held-out set is better than the RMSE on the validation set means that our model did not overfit the training data, since it performed even better on new data than expected. Our model is robust. The left side of the following graph shows the True (Triangle) and Predicted (Circle) Weight values for all the samples in the held-out set. The right side shows the histogram of the residuals. Similar to the histogram of residuals we had observed on the validation set, we observe that the residuals are not centered on 0. Our model has a tendency to overestimate the weight of the students: In this tutorial, we have successfully performed the loading of the data on S3 and let Amazon ML infer the schema and transform the data. We also created a model and evaluated its performance. Finally, we made a prediction on the held -out dataset. To understand how to leverage Amazon's powerful platform for your predictive analytics needs, check out this book Effective Amazon Machine Learning. Four interesting Amazon patents in 2018 that use machine learning, AR, and robotics Amazon Sagemaker makes machine learning on the cloud easy Amazon ML Solutions Lab to help customers “work backwards” and leverage machine learning

0
0
10120

article-image-understanding-amazon-machine-learning-workflow

Natasha Mathur

24 Aug 2018

11 min read

Understanding Amazon Machine Learning Workflow [ Tutorial ]

Natasha Mathur

24 Aug 2018

11 min read

This article presents an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. Amazon Machine Learning is an online service by Amazon Web Services (AWS) that does supervised learning for predictive analytics. Launched in April 2015 at the AWS Summit, Amazon ML joins a growing list of cloud-based machine learning services, such as Microsoft Azure, Google prediction, IBM Watson, Prediction IO, BigML, and many others. These online machine learning services form an offer commonly referred to as Machine Learning as a Service or MLaaS following a similar denomination pattern of other cloud-based services such as SaaS, PaaS, and IaaS respectively for Software, Platform, or Infrastructure as a Service. The Amazon ML workflow closely follows a standard Data Science workflow with steps: Extract the data and clean it up. Make it available to the algorithm. Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part. Select the best model by training several models on the training dataset and comparing their performances on the validation dataset. Use the best model for predictions on new data. This article is an excerpt taken from the book 'Effective Amazon Machine Learning' written by Alexis Perrier. As shown in the following Amazon ML menu, the service is built around four objects: Datasource ML model Evaluation Prediction The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. We will take a closer look at the Datasource and ML model. Amazon ML dataset For the rest of the article, we will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm. This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. In short, we do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable. We will randomly select 20% of the rows as the held-out subset to use for the prediction of previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor: By creating a new column with randomly generated numbers Sorting the spreadsheet by that column Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split) Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967. Note that it is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model. Loading the data on Amazon S3 Follow these steps to load the training and held-out datasets on S3: Go to your s3 console at https://console.aws.amazon.com/s3. Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration. Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu: Both files are small, only a few KB, and hosting costs should remain negligible for that exercise. Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on. Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv. Declaring a datasource Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default: As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user: Specifying a Datasource name is used to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot: Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents. Creating the datasource An Amazon ML datasource is composed of the following: The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3 The schema that contains information on the type of the variables contained in the CSV file: Categorical Text Numeric (real-valued) Binary It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML. At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has: Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot: Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following: The first line in the CSV file contains te column name The target is the weight We see here that Amazon ML has correctly inferred the following: sex is categorical age, height, and weight are numeric (continuous real values) Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data: predicted weight = a * age + b * height + c * sex Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are used when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model. You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target, and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model. The machine learning model We select the default parameters for the training and evaluation settings. Amazon ML will do the following: Create a step for data transformation based on the statistical properties it has inferred from the dataset Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially. The step will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed. The default advanced settings for the model are shown in the following screenshot: We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on. Regularization comes in 3 levels with a mild (10^-6), medium (10^-4), or aggressive (10^-02) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^-6) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set). Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending. At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation. It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following: { "splitting": { "percentBegin": 0, "percentEnd": 70 } } You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot: While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters: Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100. Making several passes over the training data while shuffling the samples before each path. At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made. At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version. Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation. The Amazon ML flow is smooth and facilitates the inherent data science loop: data, model, evaluation, and prediction. We looked at an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. We discussed two objects of the Amazon ML menu: Datasource and ML model. If you found this post useful, be sure to check out the book 'Effective Amazon Machine Learning' to learn about evaluation and prediction in Amazon ML along with other AWS ML concepts. Integrate applications with AWS services: Amazon DynamoDB & Amazon Kinesis [Tutorial] AWS makes Amazon Rekognition, its image recognition AI, available for Asia-Pacific developers

0
0
3573

article-image-four-ibm-facial-recognition-patents-in-2018-we-found-intriguing

Natasha Mathur

11 Aug 2018

10 min read

Four IBM facial recognition patents in 2018, we found intriguing

Natasha Mathur

11 Aug 2018

10 min read

0
0
24018

article-image-time-series-modeling-what-is-it-why-it-matters-how-its-used

Sunith Shetty

10 Aug 2018

11 min read

Time series modeling: What is it, Why it matters and How it's used

Sunith Shetty

10 Aug 2018

11 min read

A series can be defined as a number of events, objects, or people of a similar or related kind coming one after another; if we add the dimension of time, we get a time series. A time series can be defined as a series of data points in time order. In this article, we will understand what time series is and why it is one of the essential characteristics for forecasting. This article is an excerpt from a book written by Harish Gulati titled SAS for Finance. The importance of time series What importance, if any, does time series have and how will it be relevant in the future? These are just a couple of fundamental questions that any user should find answers to before delving further into the subject. Let's try to answer this by posing a question. Have you heard the terms big data, artificial intelligence (AI), and machine learning (ML)? These three terms make learning time series analysis relevant. Big data is primarily about a large amount of data that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interaction. AI is a kind of technology that is being developed by data scientists, computational experts, and others to enable processes to become more intelligent, while ML is an enabler that is helping to implement AI. All three of these terms are interlinked with the data they use, and a lot of this data is time series in its nature. This could be either financial transaction data, the behavior pattern of individuals during various parts of the day, or related to life events that we might experience. An effective mechanism that enables us to capture the data, store it, analyze it, and then build algorithms to predict transactions, behavior (and life events, in this instance) will depend on how big data is utilized and how AI and MI are leveraged. A common perception in the industry is that time series data is used for forecasting only. In practice, time series data is used for: Pattern recognition Forecasting Benchmarking Evaluating the influence of a single factor on the time series Quality control For example, a retailer may identify a pattern in clothing sales every time it gets a celebrity endorsement, or an analyst may decide to use car sales volume data from 2012 to 2017 to set a selling benchmark in units. An analyst might also build a model to quantify the effect of Lehman's crash at the height of the 2008 financial crisis in pushing up the price of gold. Variance in the success of treatments across time periods can also be used to highlight a problem, the tracking of which may enable a hospital to take remedial measures. These are just some of the examples that showcase how time series analysis isn't limited to just forecasting. In this chapter, we will review how the financial industry and others use forecasting, discuss what a good and a bad forecast is, and hope to understand the characteristics of time series data and its associated problems. Forecasting across industries Since one of the primary uses of time series data is forecasting, it's wise that we learn about some of its fundamental properties. To understand what the industry means by forecasting and the steps involved, let's visit a common misconception about the financial industry: only lending activities require forecasting. We need forecasting in order to grant personal loans, mortgages, overdrafts, or simply assess someone's eligibility for a credit card, as the industry uses forecasting to assess a borrower's affordability and their willingness to repay the debt. Even deposit products such as savings accounts, fixed-term savings, and bonds are priced based on some forecasts. How we forecast and the rationale for that methodology is different in borrowing or lending cases, however. All of these areas are related to time series, as we inevitably end up using time series data as part of the overall analysis that drives financial decisions. Let's understand the forecasts involved here a bit better. When we are assessing an individual's lending needs and limits, we are forecasting for a single person yet comparing the individual to a pool of good and bad customers who have been offered similar products. We are also assessing the individual's financial circumstances and behavior through industry-available scoring models or by assessing their past behavior, with the financial provider assessing the lending criteria. In the case of deposit products, as long as the customer is eligible to transact (can open an account and has passed know your customer (KYC), anti-money laundering (AML), and other checks), financial institutions don't perform forecasting at an individual level. However, the behavior of a particular customer is primarily driven by the interest rate offered by the financial institution. The interest rate, in turn, is driven by the forecasts the financial institution has done to assess its overall treasury position. The treasury is the department that manages the central bank's money and has the responsibility of ensuring that all departments are funded, which is generated through lending and attracting deposits at a lower rate than a bank lends. The treasury forecasts its requirements for lending and deposits, while various teams within the treasury adhere to those limits. Therefore, a pricing manager for a deposit product will price the product in such a way that the product will attract enough deposits to meet the forecasted targets shared by the treasury; the pricing manager also has to ensure that those targets aren't overshot by a significant margin, as the treasury only expects to manage a forecasted target. In both lending and deposit decisions, financial institutions do tend to use forecasting. A lot of these forecasts are interlinked, as we saw in the example of the treasury's expectations and the subsequent pricing decision for a deposit product. To decide on its future lending and borrowing positions, the treasury must have used time series data to determine what the potential business appetite for lending and borrowing in the market is and would have assessed that with the current cash flow situation within the relevant teams and institutions. Characteristics of time series data Any time series analysis has to take into account the following factors: Seasonality Trend Outliers and rare events Disruptions and step changes Seasonality Seasonality is a phenomenon that occurs each calendar year. The same behavior can be observed each year. A good forecasting model will be able to incorporate the effect of seasonality in its forecasts. Christmas is a great example of seasonality, where retailers have come to expect higher sales over the festive period. Seasonality can extend into months but is usually only observed over days or weeks. When looking at time series where the periodicity is hours, you may find a seasonality effect for certain hours of the day. Some of the reasons for seasonality include holidays, climate, and changes in social habits. For example, travel companies usually run far fewer services on Christmas Day, citing a lack of demand. During most holidays people love to travel, but this lack of demand on Christmas Day could be attributed to social habits, where people tend to stay at home or have already traveled. Social habit becomes a driving factor in the seasonality of journeys undertaken on Christmas Day. It's easier for the forecaster when a particular seasonal event occurs on a fixed calendar date each year; the issue comes when some popular holidays depend on lunar movements, such as Easter, Diwali, and Eid. These holidays may occur in different weeks or months over the years, which will shift the seasonality effect. Also, if some holidays fall closer to other holiday periods, it may lead to individuals taking extended holidays and travel sales may increase more than expected in such years. The coffee shop near the office may also experience lower sales for a longer period. Changes in the weather can also impact seasonality; for example, a longer, warmer summer may be welcome in the UK, but this would impact retail sales in the autumn as most shoppers wouldn't need to buy a new wardrobe. In hotter countries, sales of air-conditioners would increase substantially compared to the summer months' usual seasonality. Forecasters could offset this unpredictability in seasonality by building in a weather forecast variable. We will explore similar challenges in the chapters ahead. Seasonality shouldn't be confused with a cyclic effect. A cyclic effect is observed over a longer period of generally two years or more. The property sector is often associated with having a cyclic effect, where it has long periods of growth or slowdown before the cycle continues. Trend A trend is merely a long-term direction of observed behavior that is found by plotting data against a time component. A trend may indicate an increase or decrease in behavior. Trends may not even be linear, but a broad movement can be identified by analyzing plotted data. Outliers and rare events Outliers and rare events are terminologies that are often used interchangeably by businesses. These concepts can have a big impact on data, and some sort of outlier treatment is usually applied to data before it is used for modeling. It is almost impossible to predict an outlier or rare event but they do affect a trend. An example of an outlier could be a customer walking into a branch to deposit an amount that is 100 times the daily average of that branch. In this case, the forecaster wouldn't expect that trend to continue. Disruptions Disruptions and step changes are becoming more common in time series data. One reason for this is the abundance of available data and the growing ability to store and analyze it. Disruptions could include instances when a business hasn't been able to trade as normal. Flooding at the local pub may lead to reduced sales for a few days, for example. While analyzing daily sales across a pub chain, an analyst may have to make note of a disruptive event and its impact on the chain's revenue. Step changes are also more common now due to technological shifts, mergers and acquisitions, and business process re-engineering. When two companies announce a merger, they often try to sync their data. They might have been selling x and y quantities individually, but after the merger will expect to sell x + y + c (where c is the positive or negative effect of the merger). Over time, when someone plots sales data in this case, they will probably spot a step change in sales that happened around the time of the merger, as shown in the following screenshot: In the trend graph, we can see that online travel bookings are increasing. In the step change and disruptions chart, we can see that Q1 of 2012 saw a substantive increase in bookings, where Q1 of 2014 saw a substantive dip. The increase was due to the merger of two companies that took place in Q1 of 2012. The decrease in Q1 of 2014 was attributed to prolonged snow storms in Europe and the ash cloud disruption from volcanic activity over Iceland. While online bookings kept increasing after the step change, the disruption caused by the snow storm and ash cloud only had an effect on sales in Q1 of 2014. In this case, the modeler will have to treat the merger and the disruption differently while using them in the forecast, as disruption could be disregarded as an outlier and treated accordingly. Also note that the seasonality chart shows that Q4 of each year sees almost a 20% increase in travel bookings, and this pattern continues each calendar year. In this article, we defined time series and learned why it is important for forecasting. We also looked at the characteristics of time series data. To know more how to leverage the analytical power of SAS to perform financial analysis efficiently, you can check out the book SAS for Finance. Read more Getting to know SQL Server options for disaster recovery Implementing a simple Time Series Data Analysis in R Training RNNs for Time Series Forecasting

0
0
35859

$article-image-diffractive-deep-neural-network-d2nn-ucla-developed-ai-device-can-identify-objects-at-the-speed-of-light$

Bhagyashree R

08 Aug 2018

3 min read

Diffractive Deep Neural Network (D2NN): UCLA-developed AI device can identify objects at the speed of light

Bhagyashree R

08 Aug 2018

3 min read

Researchers at the University of California, Los Angeles (UCLA) have developed a 3D-printed all-optical deep learning architecture called Diffractive Deep Neural Network (D2NN). D2NN is a deep learning neural network physically formed by multiple layers of diffractive surfaces that work in collaboration to optically perform an arbitrary function. While the inference/prediction of the physical network is all-optical, the learning part that leads to its design is done through a computer. How does D2NN work? A computer-simulated design was created first, then the researchers with the help of a 3D printer created very thin polymer wafers. The uneven surface of the wafers helped diffract light coming from the object in different directions. The layers are composed of tens of thousands of artificial neurons or tiny pixels from which the light travels through. These layers together, form an “optical network” that shapes how incoming light travels through them. The network is able to identify an object because the light coming from the object is diffracted mostly toward a single pixel that is assigned to that type of object. The network was then trained using a computer to identify the objects in front of it by learning the pattern of diffracted light each object produced as the light from that object passes through the device. What are its advantages? Scalable: It can easily be scaled up using numerous high-throughput and large-area 3D fabrication methods, such as, soft-lithography, additive manufacturing, and wide-field optical components and detection systems. Easily reconfigurable: D2NN can be easily improved by additional 3D printed layers or replacing some of the existing layers with newly trained ones. Lightening speed: Once the device is trained, it works at the speed of light. Efficient: No energy is consumed to run the device. Cost-effective: The device can be reproduced for less than $50, making it very cost-effective. What are the areas it can be used in? Image analysis Feature detection Object classification Can also enable new microscope or camera designs that can perform unique imaging tasks This new AI device could find applications in the area of medical technologies, data intensive tasks, robotics, security, and or any application where image and video data are essential. Refer to UCLA’s official news article to know more in detail. Also, you can refer to this paper All-optical machine learning using diffractive deep neural Networks. OpenAI builds reinforcement learning based system giving robots human like dexterity Datasets and deep learning methodologies to extend image-based applications to videos AutoAugment: Google’s research initiative to improve deep learning performance

0
0
14700

article-image-leaders-successful-agile-enterprises-share-in-common

Packt Editorial Staff

30 Jul 2018

11 min read

What leaders at successful agile Enterprises share in common

Packt Editorial Staff

30 Jul 2018

11 min read

Adopting agile ways of working is easier said than done. Firms like Barclays, C.H.Robinson, Ericsson, Microsoft, and Spotify are considered as agile enterprises and are operating entrepreneurially on a large scale. Do you think the leadership of these firms have something in common? Let us take a look at it in this article. The leadership of a firm has a very high bearing on the extent of Enterprise Agility which the company can achieve. Leaders are in a position to influence just about every aspect of a business, including vision, mission, strategy, structure, governance, processes, and more importantly, the culture of the enterprise and the mindset of the employees. This article is an extract from the Enterprise Agility written by Sunil Mundra. In this article we’ll explore the personal traits of leaders that are critical for Enterprise Agility. Personal traits are by definition intrinsic in nature. They enable the personal development of an individual and are also enablers for certain behaviors. We explore the various personal traits in detail. #1 Willingness to expand mental models Essentially, a mental model is an individual's perception of reality and how something works in that reality. A mental model represents one way of approaching a situation and is a form of deeply-held belief. The critical point is that a mental model represents an individual's view, which may not be necessarily true. Leaders must also consciously let go of mental models that are no longer relevant today. This is especially important for those leaders who have spent a significant part of their career leading enterprises based on mechanistic modelling, as these models will create impediments for Agility in "living" businesses. For example, using monetary rewards as a primary motivator may work for physical work, which is repetitive in nature. However, it does not work as a primary motivator for knowledge workers, for whom intrinsic motivators, namely, autonomy, mastery, and purpose, are generally more important than money. Examining the values and assumptions underlying a mental model can help in ascertaining the relevance of that model. #2 Self-awareness Self-awareness helps leaders to become cognizant of their strengths and weaknesses. This will enable the leaders to consciously focus on utilizing their strengths and leveraging the strengths of their peers and teams, in areas where they are not strong. Leaders should validate the view of strengths and weaknesses by seeking feedback regularly from people that they work with. According to a survey of senior executives, by Cornell's School of Industrial and Labor Relations: "Leadership searches give short shrift to 'self-awareness,' which should actually be a top criterion. Interestingly, a high self-awareness score was the strongest predictor of overall success. This is not altogether surprising as executives who are aware of their weaknesses are often better able to hire subordinates who perform well in categories in which the leader lacks acumen. These leaders are also more able to entertain the idea that someone on their team may have an idea that is even better than their own." Self-awareness, a mostly underrated trait, is a huge enabler for enhancing other personal traits. #3 Creativity Since emergence is a primary property of complexity, leaders will often be challenged to deal with unprecedented circumstances emerging from within the enterprise and also in the external environment. This implies that what may have worked in the past is less likely to work in the new circumstances, and new approaches will be needed to deal with them. Hence, the ability to think creatively, that is, "out of the box," for coming up with innovative approaches and solutions is critical. The creativity of an individual will have its limitations, and hence leaders must harness the creativity of a broader group of people in the enterprise. A leader can be a huge enabler to this by ideating jointly with a group of people and also by facilitating discussions by challenging status quo and spurring the teams to suggest improvements. Leaders can also encourage innovation through experimentation. With the fast pace of change in the external environment, and consequently the continuous evolution of businesses, leaders will often find themselves out of their comfort zone. Leaders will therefore have to get comfortable with being uncomfortable. It will be easier for leaders to think more creatively once they accept this new reality. #4 Emotional intelligence Emotional intelligence (EI), also known as emotional quotient (EQ), is defined by Wikipedia as "the capability of individuals to recognize their own emotions and those of others, discern between different feelings and label them appropriately, use emotional information to guide thinking and behavior, and manage and/or adjust emotions to adapt to environments or achieve one's goal/s". [iii] EI is made up of four core skills: Self-awareness Social awareness Self-management Relationship management The importance of EI in people-centric enterprises, especially for leaders, cannot be overstated. While people in a company may be bound by purpose and by being a part of a team, people are inherently different from each other in terms of personality types and emotions. This can have a significant bearing on how people in a business deal with and react to circumstances, especially adverse ones. Having high EI enables leaders to understand people "from the inside." This helps leaders to build better rapport with people, thereby enabling them to bring out the best in employees and support them as needed. #5 Courage An innovative approach to dealing with an unprecedented circumstance will, by definition, carry some risk. The hypothesis about the appropriateness of that approach can only be validated by putting it to the test against reality. Leaders will therefore need to be courageous as they take the calculated risky bets, strike hard, and own the outcome of those bets. According to Guo Xiao, the President and CEO of ThoughtWorks, "There are many threats—and opportunities—facing businesses in this age of digital transformation: industry disruption from nimble startups, economic pressure from massive digital platforms, evolving security threats, and emerging technologies. Today's era, in which all things are possible, demands a distinct style of leadership. It calls for bold individuals who set their company's vision and charge ahead in a time of uncertainty, ambiguity, and boundless opportunity. It demands courage." Taking risks does not mean being reckless. Rather, leaders need to take calculated risks, after giving due consideration to intuition, facts, and opinions. Despite best efforts and intentions, some decisions will inevitably go wrong. Leaders must have the courage and humility to admit that the decision went wrong and own the outcomes of that decision, and not let these failures deter them from taking risks in the future. #6 Passion for learning Learnability is the ability to upskill, reskill, and deskill. In today's highly dynamic era, it is not what one knows, or what skills one has, that matters as much as the ability to quickly adapt to a different skill set. It is about understanding what is needed to optimize success and what skills and abilities are necessary, from a leadership perspective, to make the enterprise as a whole successful. Leaders need to shed inhibitions about being seen as "novices" while they acquire and practice new skills. The fact that leaders are willing to acquire new skills can be hugely impactful in terms of encouraging others in the enterprise to do the same. This is especially important in terms of bringing in and encouraging the culture of learnability across the business. #7 Awareness of cognitive biases Cognitive biases are flaws in thinking that can lead to suboptimal decisions. Leaders need to become aware of these biases so that they can objectively assess whether their decisions are being influenced by any biases. Cognitive biases lead to shortcuts in decision-making. Essentially, these biases are an attempt by the brain to simplify information processing. Leaders today are challenged with an overload of information and also the need to make decisions quickly. These factors can contribute to decisions and judgements being influenced by cognitive biases. Over decades, psychologists have discovered a huge number of biases. However, the following biases are more important from decision-making perspective: Confirmation bias This is the tendency of selectively seeking and holding onto information to reaffirm what you already believe to be true. For example, a leader believes that a recently launched product is doing well, based on the initial positive response. He has developed a bias that this product is successful. However, although the product is succeeding in attracting new customers, it is also losing existing customers. The confirmation bias is making the leader focus only on data pertaining to new customers, so he is ignoring data related to the loss of existing customers. Bandwagon effect bias Bandwagon effect bias, also known as "herd mentality," encourages doing something because others are doing it. The bias creates a feeling of not wanting to be left behind and hence can lead to irrational or badly-thought-through decisions. Enterprises launching the Agile transformation initiative, without understanding the implications of the long and difficult journey ahead, is an example of this bias. "Guru" bias Guru bias leads to blindly relying on an expert's advice. This can be detrimental, as the expert could be wrong in their assessment and therefore the advice could also be wrong. Also, the expert might give advice which is primarily furthering his or her interests over the interests of the enterprise. Projection bias Projection bias leads the person to believe that other people have understood and are aligned with their thinking, while in reality this may not be true. This bias is more prevalent in enterprises where employees are fearful of admitting that they have not understood what their "bosses" have said, asking questions to clarify or expressing disagreement. Stability bias Stability bias, also known as "status quo" bias, leads to a belief that change will lead to unfavorable outcomes, that is, the risk of loss is greater than the possibility of benefit. It makes a person believe that stability and predictability lead to safety. For decades, the mandate for leaders was to strive for stability and hence, many older leaders are susceptible to this bias. Leaders must encourage others in the enterprise to challenge biases, which can uncover "blind spots" arising from them. Once decisions are made, attention should be paid to information coming from feedback. #8 Resilience Resilience is the capacity to quickly recover from difficulties. Given the turbulent business environment, rapidly changing priorities, and the need to take calculated risks, leaders are likely to encounter difficult and challenging situations quite often. Under such circumstances, having resilience will help the leader to "take knocks on the chin" and keep moving forward. Resilience is also about maintaining composure when something fails, analyzing the failure with the team in an objective manner and leaning from that failure. The actions of leaders are watched by the people in the enterprise even more closely in periods of crisis and difficulty, and hence leaders showing resilience go a long way in increasing resilience across the company. #9 Responsiveness Responsiveness, from the perspective of leadership, is the ability to quickly grasp and respond to both challenges and opportunities. Leaders must listen to feedback coming from customers and the marketplace, learn from it, and adapt accordingly. Leaders must be ready to enable the morphing of the enterprise's offerings in order to stay relevant for customers and also to exploit opportunities. This implies that leaders must be willing to adjust the "pivot" of their offerings based on feedback, for example, the journey of Amazon Web Services, which was an internal system but has now grown into a highly successful business. Other prominent examples are Twitter, which was an offshoot of Odeo, a website focused on sound and podcasting, and PayPal's move from transferring money via PalmPilots to becoming a highly robust online payment service. We discovered that leaders are the primary catalysts for any enterprise aspiring to enhance its Agility. Leaders need specific capabilities, which are over and above the standard leadership capabilities, in order to take the business on the path of enhanced Enterprise Agility. These capabilities comprise of personal traits and behaviors that are intrinsic in nature and enable leadership Agility, which is the foundation of Enterprise Agility. Want to know more about how an enterprise can thrive in a dynamic business environment, check out the book Enterprise Agility. Skill Up 2017: What we learned about tech pros and developers 96% of developers believe developing soft skills is important Soft skills every data scientist should teach their child

0
1
21668

article-image-how-does-elasticsearch-work-tutorial

Savia Lobo

30 Jul 2018

12 min read

How does Elasticsearch work? [Tutorial]

Savia Lobo

30 Jul 2018

12 min read

0
2
86390

article-image-deepcube-a-new-deep-reinforcement-learning-approach-solves-the-rubiks-cube-with-no-human-help

Savia Lobo

29 Jul 2018

4 min read

DeepCube: A new deep reinforcement learning approach solves the Rubik’s cube with no human help

Savia Lobo

29 Jul 2018

4 min read

Humans have been excellent players in most of the gameplays be it indoor or outdoors. However, over the recent years we have been increasingly coming across machines that are playing and winning popular board games Go and Chess against humans using machine learning algorithms. If you think machines are only good at solving the black and whites, you are wrong. The recent achievement of a machine trying to solve a complex game (a Rubik’s cube) is DeepCube. Rubik cube is a challenging piece of puzzle that’s captivated everyone since childhood. Solving it is a brag-worthy accomplishment for most adults. A group of UC Irvine researchers have now developed a new algorithm (used by DeepCube) known as Autodidactic Iteration, which can solve a Rubik’s cube with no human assistance. The Erno Rubik’s cube conundrum Rubik’s cube, a popular three-dimensional puzzle was developed by Erno Rubik in the year 1974. Rubik worked for a month to figure out the first algorithm to solve the cube. Researchers at the UC Irvine state that “Since then, the Rubik’s Cube has gained worldwide popularity and many human-oriented algorithms for solving it have been discovered. These algorithms are simple to memorize and teach humans how to solve the cube in a structured, step-by-step manner.” After the cube became popular among mathematicians and computer scientists, questions around how to solve the cube with least possible turns became mainstream. In 2014, it was proved that the least number of steps to solve the cube puzzle was 26. More recently, computer scientists have tried to find ways for machines to solve the Rubik’s cube. As a first step, they tried and tested ways to use the same successful approach tried in the games Go and Chess. However, this approach did not work well for the Rubik’s cube. The approach: Rubik vs Chess and Go Algorithms used in Go and Chess are fed with rules of the game and then they play against themselves. The deep learning machine here is rewarded based on its performance at every step it takes. Reward process is considered as important as it helps the machine to distinguish between a good and a bad move. Following this, the machine starts playing well i.e it learns how to play well. On the other hand, the rewards in the case of Rubik’s cube are nearly hard to determine. This is because there are random turns in the cube and it is hard to judge whether the new configuration is any closer to a solution. The random turns can be unlimited and hence earning an end-state reward is very rare. Both Chess and Go have a large search space but each move can be evaluated and rewarded accordingly. This isn’t the case for Rubik’s cube! UC Irvine researchers have found a way for machines to create its own set of rewards in the Autodidactic Iteration method for DeepCube. Autodidactic Iteration: Solving the Rubik’s Cube without human Knowledge DeepCube’s Autodidactic Iteration (ADI) is a form of deep learning known as deep reinforcement learning (DRL). It combines classic reinforcement learning, deep learning, and Monte Carlo Tree Search (MCTS). When DeepCube gets an unsolved cube, it decides whether the specific move is an improvement on the existing configuration. To do this, it must be able to evaluate the move. The algorithm, Autodidactic iteration starts with the finished cube and works backwards to find a configuration that is similar to the proposed move. Although this process is imperfect, deep learning helps the system figure out which moves are generally better than others. Researchers trained a network using ADI for 2,000,000 iterations. They further reported, “The network witnessed approximately 8 billion cubes, including repeats, and it trained for a period of 44 hours. Our training machine was a 32-core Intel Xeon E5-2620 server with three NVIDIA Titan XP GPUs.” After training, the network uses a standard search tree to hunt for suggested moves for each configuration. The researchers in their paper said, “Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves — less than or equal to solvers that employ human domain knowledge.” Researchers also wrote, “DeepCube is able to teach itself how to reason in order to solve a complex environment with only one reward state using pure reinforcement learning.” Furthermore, this approach will have a potential to provide approximate solutions to a broad class of combinatorial optimization problems. To explore Deep Reinforcement Learning check out our latest releases, Hands-On Reinforcement Learning with Python and Deep Reinforcement Learning Hands-On. How greedy algorithms work Creating a reference generator for a job portal using Breadth First Search (BFS) algorithm Anatomy of an automated machine learning algorithm (AutoML)

0
0
20484

article-image-creating-effective-dashboards-using-splunk-tutorial

Sunith Shetty

28 Jul 2018

10 min read

Creating effective dashboards using Splunk [Tutorial]

Sunith Shetty

28 Jul 2018

10 min read

Splunk is easy to use for developing a powerful analytical dashboard with multiple panels. A dashboard with too many panels, however, will require scrolling down the page and can cause the viewer to miss crucial information. An effective dashboard should generally meet the following conditions: Single screen view: The dashboard fits in a single window or page, with no scrolling Multiple data points: Charts and visualizations should display a number of data points Crucial information highlighted: The dashboard points out the most important information, using appropriate titles, labels, legends, markers, and conditional formatting as required Created with the user in mind: Data is presented in a way that is meaningful to the user Loads quickly: The dashboard returns results in 10 seconds or less Avoid redundancy: The display does not repeat information in multiple places In this tutorial, we learn to create different types of dashboards using Splunk. We will also discuss how to gather business requirements for your dashboards. Types of Splunk dashboards There are three kinds of dashboards typically created with Splunk: Dynamic form-based dashboards Real-time dashboards Dashboards as scheduled reports Dynamic form-based dashboards allow Splunk users to modify the dashboard data without leaving the page. This is accomplished by adding data-driven input fields (such as time, radio button, textbox, checkbox, dropdown, and so on) to the dashboard. Updating these inputs changes the data based on the selections. Dynamic form-based dashboards have existed in traditional business intelligence tools for decades now, so users who frequently use them will be familiar with changing prompt values on the fly to update the dashboard data. Real-time dashboards are often kept on a big panel screen for constant viewing, simply because they are so useful. You see these dashboards in data centers, network operations centers (NOCs), or security operations centers (SOCs) with constant format and data changing in real time. The dashboard will also have indicators and alerts for operators to easily identify and act on a problem. Dashboards like this typically show the current state of security, network, or business systems, using indicators for web performance and traffic, revenue flow, login failures, and other important measures. Dashboards as scheduled reports may not be exposed for viewing; however, the dashboard view will generally be saved as a PDF file and sent to email recipients at scheduled times. This format is ideal when you need to send information updates to multiple recipients at regular intervals, and don't want to force them to log in to Splunk to capture the information themselves. We will create the first two types of dashboards, and you will learn how to use the Splunk dashboard editor to develop advanced visualizations along the way. Gathering business requirements As a Splunk administrator, one of the most important responsibilities is to be responsible for the data. As a custodian of data, a Splunk admin has significant influence over how to interpret and present information to users. It is common for the administrator to create the first few dashboards. A more mature implementation, however, requires collaboration to create an output that is beneficial to a variety of user requirements and may be completed by a Splunk development resource with limited administrative rights. Make it a habit to consistently request users input regarding the Splunk delivered dashboards and reports and what makes them useful. Sit down with day-to-day users and layout, on a drawing board, for example, the business process flows or system diagrams to understand how the underlying processes and systems you're trying to measure really work. Look for key phrases like these, which signify what data is most important to the business: If this is broken, we lose tons of revenue... This is a constant point of failure... We don't know what's going on here... If only I can see the trend, it will make my work easier... This is what my boss wants to see... Splunk dashboard users may come from many areas of the business. You want to talk to all the different users, no matter where they are on the organizational chart. When you make friends with the architects, developers, business analysts, and management, you will end up building dashboards that benefit the organization, not just individuals. With an initial dashboard version, ask for users thoughts as you observe them using it in their work and ask what can be improved upon, added, or changed. We hope that at this point, you realize the importance of dashboards and are ready to get started creating some, as we will do in the following sections. Dynamic form-based dashboard In this section, we will create a dynamic form-based dashboard in our Destinations app to allow users to change input values and rerun the dashboard, presenting updated data. Here is a screenshot of the final output of this dynamic form-based dashboard: Let's begin by creating the dashboard itself and then generate the panels: Go the search bar in the Destinations app Run this search command: SPL> index=main status_type="*" http_uri="*" server_ip="*" | top status_type, status_description, http_uri, server_ip Be careful when copying commands with quotation marks. It is best to type in the entire search command to avoid problems. Go to Save As | Dashboard Panel Fill in the information based on the following screenshot: Click on Save Close the pop-up window that appears (indicating that the dashboard panel was created) by clicking on the X in the top-right corner of the window Creating a Status Distribution panel We will go to the after all the panel searches have been generated. Let's go ahead and create the second panel: In the search window, type in the following search command: SPL> index=main status_type="*" http_uri=* server_ip=* | top status_type You will save this as a dashboard panel in the newly created dashboard. In the Dashboard option, click on the Existing button and look for the new dashboard, as seen here. Don't forget to fill in the Panel Title as Status Distribution: Click on Save when you are done and again close the pop-up window, signaling the addition of the panel to the dashboard. Creating the Status Types Over Time panel Now, we'll move on to create the third panel: Type in the following search command and be sure to run it so that it is the active search: SPL> index=main status_type="*" http_uri=* server_ip=* | timechart count by http_status_code You will save this as a Dynamic Form-based Dashboard panel as well. Type in Status Types Over Time in the Panel Title field: Click on Save and close the pop-up window, signaling the addition of the panel to the dashboard. Creating the Hits vs Response Time panel Now, on to the final panel. Run the following search command: SPL> index=main status_type="*" http_uri=* server_ip=* | timechart count, avg(http_response_time) as response_time Save this dashboard panel as Hits vs Response Time: Arrange the dashboard We'll move on to look at the dashboard we've created and make a few changes: Click on the View Dashboard button. If you missed out on the View Dashboard button, you can find your dashboard by clicking on Dashboards in the main navigation bar. Let's edit the panel arrangement. Click on the Edit button. Move the Status Distribution panel to the upper-right row. Move the Hits vs Response Time panel to the lower-right row. Click on Save to save your layout changes. Look at the following screenshot. The dashboard framework you've created should now look much like this. The dashboard probably looks a little plainer than you expected it to. But don't worry; we will improve the dashboard visuals one panel at a time: Panel options in dashboards In this section, we will learn how to alter the look of our panels and create visualizations. Go to the edit dashboard mode by clicking on the Edit button. Each dashboard panel will have three setting options to work with: edit search, select visualization, and visualization format options. They are represented by three drop-down icons: The Edit Search window allows you to modify the search string, change the time modifier for the search, add auto-refresh and progress bar options, as well as convert the panel into a report: The Select Visualization dropdown allows you to change the type of visualization to use for the panel, as shown in the following screenshot: Finally, the Visualization Options dropdown will give you the ability to fine-tune your visualization. These options will change depending on the visualization you select. For a normal statistics table, this is how it will look: Pie chart – Status Distribution Go ahead and change the Status Distribution visualization panel to a pie chart. You do this by selecting the Select Visualization icon and selecting the Pie icon. Once done, the panel will look like the following screenshot: Stacked area chart – Status Types Over Time We will change the view of the Status Types Over Time panel to an area chart. However, by default, area charts will not be stacked. We will update this through adjusting the visualization options: Change the Status Types Over Time panel to an Area Chart using the same Select Visualization button as the prior pie chart exercise. Make the area chart stacked using the Format Visualization icon. In the Stack Mode section, click on Stacked. For Null Values, select Zero. Use the chart that follows for guidance: Click on Apply. The panel will change right away. Remove the _time label as it is already implied. You can do this in the X-Axis section by setting the Title to None. Close the Format Visualization window by clicking on the X in the upper-right corner: Here is the new stacked area chart panel: Column with overlay combination chart – Hits vs Response Time When representing two or more kinds of data with different ranges, using a combination chart—in this case combining a column and a line—can tell a bigger story than one metric and scale alone. We'll use the Hits vs Response Time panel to explore the combination charting options: In the Hits vs Response Time panel, change the chart panel visualization to Column In the Visualization Options window, click on Chart Overlay In the Overlay selection box, select response_time Turn on View as Axis Click on X-Axis from the list of options on the left of the window and change the Title to None Click on Legend from the list of options on the left Change the Legend Position to Bottom Click on the X in the upper-right-hand corner to close the Visualization Options window The new panel will now look similar to the following screenshot. From this and the prior screenshot, you can see there was clearly an outage in the overnight hours: Click on Done to save all the changes you made and exit the Edit mode The dashboard has now come to life. This is how it should look now: To summarize we saw how to create different types of dashboards. To know more about core Splunk functionalities to transform machine data into powerful insights, check out this book Splunk 7 Essentials, Third Edition. Splunk leverages AI in its monitoring tools Splunk Industrial Asset Intelligence (Splunk IAI) targets Industrial IoT marketplace Create a data model in Splunk to enable interactive reports and dashboards

0
0
24904

article-image-23andme-share-client-genetic-data-with-gsk-drug-target-discovery

Sugandha Lahoti

28 Jul 2018

3 min read

23andMe shares 5mn client genetic data with GSK for drug target discovery, a machine learning application in genetics research

Sugandha Lahoti

28 Jul 2018

3 min read

Genetics company 23andMe, which uses machine learning algorithms for human genome analysis, has entered into a four year collaboration with pharmaceutical giant GlaxoSmithKline. They will now share their 5 million client genetic data with GSK to advance research into treatments of diseases. This collaboration will be used to identify novel drug targets, tackle new subsets of disease and enable rapid progression of clinical programs. The 12 years old firm has already published more than 100 scientific papers based on its customers' data. All activities within the collaboration will initially be co-funded, with either company having certain rights to reduce its funding share. "The goal of the collaboration is to gather insights and discover novel drug targets driving disease progression and develop therapies," GlaxoSmithKline said in a press release. GSK is also reported to have invested $300 million in 23andMe. During the four year collaboration GSK will use 23andMe’s database and statistical analytics for drug target discovery. This collaboration will be used to design GSK’s LRRK2 inhibitor, which is in development for the potential treatment for Parkinson’s disease. 23andMe’s database of consented customers who have a LRRK2 variant status will be used to accelerate the progress of this programme. Together, GSK and 23andMe will target and recruit patients with defined LRRK2 mutations in order to reach clinical proof of concept. 23andMe have made it quite clear that participating in this program is voluntary and requires clients to affirmatively consent to participate. However not everyone is clear of how this would work. First, the company has specified that any research involving customer data that has already been performed or published prior to receipt of withdrawal request will not be reversed. This may have a negative effect as people are generally not aware of all the privacy policies and generally don’t read the Terms of Service. Moreover, as Peter Pitts, president of the Center for Medicine in the Public Interest, notes, “If a person's DNA is used in research, that person should be compensated. Customers shouldn’t be paying for the privilege of 23andMe working with a for-profit company in a for-profit research project.” Both the companies have sworn to provide maximum data protection for their employees. In a blog post, they note, “The continued protection of customers’ data and privacy is the highest priority for both GSK and 23andMe. Both companies have stringent security protections in place when it comes to collecting, storing and transferring information about research participants.” You can read more about the news, on a blog by 23andMe founder, Anne Wojcicki. 6 use cases of Machine Learning in Healthcare Healthcare Analytics: Logistic Regression to Reduce Patient Readmissions NIPS 2017 Special: How machine learning for genomics is bridging the gap between research and clinical trial success by Brendan Frey

0
0
12960

article-image-apache-druid-hadoop-data-visualizations-tutorial

Sunith Shetty

27 Jul 2018

9 min read

Setting up Apache Druid in Hadoop for Data visualizations [Tutorial]

Sunith Shetty

27 Jul 2018

9 min read

0
0
31223

article-image-automl-build-machine-learning-pipeline-tutorial

Sunith Shetty

27 Jul 2018

15 min read

Use AutoML for building simple to complex machine learning pipelines [Tutorial]

Sunith Shetty

27 Jul 2018

15 min read

Many moving parts have to be tied together for an ML model to execute and produce results successfully. This process of tying together different pieces of the ML process is known as pipelines. A pipeline is a generalized concept but a very important concept for a Data Scientist. In software engineering, people build pipelines to develop software that is exercised from source code to deployment. Similarly, in ML, a pipeline is created to allow data flow from its raw format to some useful information. It provides a mechanism to construct a multi-ML parallel pipeline system in order to compare the results of several ML methods. In this tutorial, we see how to create our own AutoML pipelines. You will understand how to build pipelines in order to handle the model building process. Each stage of a pipeline is fed processed data from its preceding stage; that is, the output of a processing unit is supplied as an input to its next step. The data flows through the pipeline just as water flows in a pipe. Mastering the pipeline concept is a powerful way to create error-free ML models, and pipelines form a crucial element for building an AutoML system. The code files for this article are available on Github. This article is an excerpt from a book written by Sibanjan Das, Umit Mert Cakmak titled Hands-On Automated Machine Learning. Getting to know machine learning pipelines Usually, an ML algorithm needs clean data to detect some patterns in the data and make predictions over a new dataset. However, in real-world applications, the data is often not ready to be directly fed into an ML algorithm. Similarly, the output from an ML model is just numbers or characters that need to be processed for performing some actions in the real world. To accomplish that, the ML model has to be deployed in a production environment. This entire framework of converting raw data to usable information is performed using a ML pipeline. The following is a high-level illustration of an ML pipeline: We will break down the blocks illustrated in the preceding figure as follows: Data Ingestion: It is the process of obtaining data and importing data for use. Data can be sourced from multiple systems, such as Enterprise Resource Planning (ERP) software, Customer Relationship Management (CRM) software, and web applications. The data extraction can be in the real time or batches. Sometimes, acquiring the data is a tricky part and is one of the most challenging steps as we need to have a good business and data understanding abilities. Data Preparation: There are several methods to preprocess the data to a suitable form for building models. Real-world data is often skewed—there is missing data, which is sometimes noisy. It is, therefore, necessary to preprocess the data to make it clean and transformed, so it's ready to be run through the ML algorithms. ML model training: It involves the use of various ML techniques to understand essential features in the data, make predictions, or derive insights out of it. Often, the ML algorithms are already coded and available as API or programming interfaces. The most important responsibility we need to take is to tune the hyperparameters. The use of hyperparameters and optimizing them to create a best-fitting model are the most critical and complicated parts of the model training phase. Model Evaluation: There are various criteria using which a model can be evaluated. It is a combination of statistical methods and business rules. In an AutoML pipeline, the evaluation is mostly based on various statistical and mathematical measures. If an AutoML system is developed for some specific business domain or use cases, then the business rules can also be embedded into the system to evaluate the correctness of a model. Retraining: The first model that we create for a use case is not often the best model. It is considered as a baseline model, and we try to improve the model's accuracy by training it repetitively. Deployment: The final step is to deploy the model that involves applying and migrating the model to business operations for their use. The deployment stage is highly dependent on the IT infrastructure and software capabilities an organization has. As we see, there are several stages that we will need to perform to get results out of an ML model. The scikit-learn provides us a pipeline functionality that can be used to create several complex pipelines. While building an AutoML system, pipelines are going to be very complex, as many different scenarios have to be captured. However, if we know how to preprocess the data, utilizing an ML algorithm and applying various evaluation metrics, a pipeline is a matter of giving a shape to those pieces. Let's design a very simple pipeline using scikit-learn. Simple ML pipeline We will first import a dataset known as Iris, which is already available in scikit-learn's sample dataset library (http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html). The dataset consists of four features and has 150 rows. We will be developing the following steps in a pipeline to train our model using the Iris dataset. The problem statement is to predict the species of an Iris data using four different features: In this pipeline, we will use a MinMaxScaler method to scale the input data and logistic regression to predict the species of the Iris. The model will then be evaluated based on the accuracy measure: The first step is to import various libraries from scikit-learn that will provide methods to accomplish our task. The only addition is the Pipeline method from sklearn.pipeline. This will provide us with necessary methods needed to create an ML pipeline: from sklearn.datasets import load_iris from sklearn.preprocessing import MinMaxScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline The next step is to load the iris data and split it into training and test dataset. In this example, we will use 80% of the dataset to train the model and the remaining 20% to test the accuracy of the model. We can use the shape function to view the dimension of the dataset: # Load and split the data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) X_train.shape The following result shows the training dataset having 4 columns and 120 rows, which equates to 80% of the Iris dataset and is as expected: Next, we print the dataset to take a glance at the data: print(X_train) The preceding code provides the following output: The next step is to create a pipeline. The pipeline object is in the form of (key, value) pairs. Key is a string that has the name for a particular step, and value is the name of the function or actual method. In the following code snippet, we have named the MinMaxScaler() method as minmax and LogisticRegression(random_state=42) as lr: pipe_lr = Pipeline([('minmax', MinMaxScaler()), ('lr', LogisticRegression(random_state=42))]) Then, we fit the pipeline object—pipe_lr—to the training dataset: pipe_lr.fit(X_train, y_train) When we execute the preceding code, we get the following output, which shows the final structure of the fitted model that was built: The last step is to score the model on the test dataset using the score method: score = pipe_lr.score(X_test, y_test) print('Logistic Regression pipeline test accuracy: %.3f' % score) As we can note from the following results, the accuracy of the model was 0.900, which is 90%: In the preceding example, we created a pipeline, which constituted of two steps, that is, minmax scaling and LogisticRegression. When we executed the fit method on the pipe_lr pipeline, the MinMaxScaler performed a fit and transform method on the input data, and it was passed on to the estimator, which is a logistic regression model. These intermediate steps in a pipeline are known as transformers, and the last step is an estimator. Transformers are used for data preprocessing and has two methods, fit and transform. The fit method is used to find parameters from the training data, and the transform method is used to apply the data preprocessing techniques to the dataset. Estimators are used for creating machine learning model and has two methods, fit and predict. The fit method is used to train a ML model, and the predict method is used to apply the trained model on a test or new dataset. This concept is summarized in the following figure: We have to call only the pipeline's fit method to train a model and call the predict method to create predictions. Rest all functions that is, Fit and Transform are encapsulated in the pipeline's functionality and executed as shown in the preceding figure. Sometimes, we will need to write some custom functions to perform custom transformations. The following section is about function transformer that can assist us in implementing this custom functionality. FunctionTransformer A FunctionTransformer is used to define a user-defined function that consumes the data from the pipeline and returns the result of this function to the next stage of the pipeline. This is used for stateless transformations, such as taking the square or log of numbers, defining custom scaling functions, and so on. In the following example, we will build a pipeline using the CustomLog function and the predefined preprocessing method StandardScaler: We import all the required libraries as we did in our previous examples. The only addition here is the FunctionTransformer method from the sklearn.preprocessing library. This method is used to execute a custom transformer function and stitch it together to other stages in a pipeline: import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn import preprocessing from sklearn.pipeline import make_pipeline from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import StandardScaler In the following code snippet, we will define a custom function, which returns the log of a number X: def CustomLog(X): return np.log(X) Next, we will define a data preprocessing function named PreprocData, which accepts the input data (X) and target (Y) of a dataset. For this example, the Y is not necessary, as we are not going to build a supervised model and just demonstrate a data preprocessing pipeline. However, in the real world, we can directly use this function to create a supervised ML model. Here, we use a make_pipeline function to create a pipeline. We used the pipeline function in our earlier example, where we have to define names for the data preprocessing or ML functions. The advantage of using a make_pipeline function is that it generates the names or keys of a function automatically: def PreprocData(X, Y): pipe = make_pipeline( FunctionTransformer(CustomLog),StandardScaler() ) X_train, X_test, Y_train, Y_test = train_test_split(X, Y) pipe.fit(X_train, Y_train) return pipe.transform(X_test), Y_test As we are ready with the pipeline, we can load the Iris dataset. We print the input data X to take a look at the data: iris = load_iris() X, Y = iris.data, iris.target print(X) The preceding code prints the following output: Next, we will call the PreprocData function by passing the iris data. The result returned is a transformed dataset, which has been processed first using our CustomLog function and then using the StandardScaler data preprocessing method: X_transformed, Y_transformed = PreprocData(X, Y) print(X_transformed) The preceding data transformation task yields the following transformed data results: We will now need to build various complex pipelines for an AutoML system. In the following section, we will create a sophisticated pipeline using several data preprocessing steps and ML algorithms. Complex ML pipeline In this section, we will determine the best classifier to predict the species of an Iris flower using its four different features. We will use a combination of four different data preprocessing techniques along with four different ML algorithms for the task. The following is the pipeline design for the job: We will proceed as follows: We start with importing the various libraries and functions that are required for the task: from sklearn.datasets import load_iris from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn import svm from sklearn import tree from sklearn.pipeline import Pipeline Next, we load the Iris dataset and split it into train and test datasets. The X_train and Y_train dataset will be used for training the different models, and X_test and Y_test will be used for testing the trained model: # Load and split the data iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42) Next, we will create four different pipelines, one for each model. In the pipeline for the SVM model, pipe_svm, we will first scale the numeric inputs using StandardScaler and then create the principal components using Principal Component Analysis (PCA). Finally, a Support Vector Machine (SVM) model is built using this preprocessed dataset. Similarly, we will construct a pipeline to create the KNN model named pipe_knn. Only StandardScaler is used to preprocess the data before executing the KNeighborsClassifier to create the KNN model. Then, we create a pipeline for building a decision tree model. We use the StandardScaler and MinMaxScaler methods to preprocess the data to be used by the DecisionTreeClassifier method. The last model created using a pipeline is the random forest model, where only the StandardScaler is used to preprocess the data to be used by the RandomForestClassifier method. The following is the code snippet for creating these four different pipelines used to create four different models: # Construct svm pipeline pipe_svm = Pipeline([('ss1', StandardScaler()), ('pca', PCA(n_components=2)), ('svm', svm.SVC(random_state=42))]) # Construct knn pipeline pipe_knn = Pipeline([('ss2', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=6, metric='euclidean'))]) # Construct DT pipeline pipe_dt = Pipeline([('ss3', StandardScaler()), ('minmax', MinMaxScaler()), ('dt', tree.DecisionTreeClassifier(random_state=42))]) # Construct Random Forest pipeline num_trees = 100 max_features = 1 pipe_rf = Pipeline([('ss4', StandardScaler()), ('pca', PCA(n_components=2)), ('rf', RandomForestClassifier(n_estimators=num_trees, max_features=max_features))]) Next, we will need to store the name of pipelines in a dictionary, which would be used to display results: pipe_dic = {0: 'K Nearest Neighbours', 1: 'Decision Tree', 2:'Random Forest', 3:'Support Vector Machines'} Then, we will list the four pipelines to execute those pipelines iteratively: pipelines = [pipe_knn, pipe_dt,pipe_rf,pipe_svm] Now, we are ready with the complex structure of the whole pipeline. The only things that remain are to fit the data to the pipeline, evaluate the results, and select the best model. In the following code snippet, we fit each of the four pipelines iteratively to the training dataset: # Fit the pipelines for pipe in pipelines: pipe.fit(X_train, y_train) Once the model fitting is executed successfully, we will examine the accuracy of the four models using the following code snippet: # Compare accuracies for idx, val in enumerate(pipelines): print('%s pipeline test accuracy: %.3f' % (pipe_dic[idx], val.score(X_test, y_test))) We can note from the following results that the k-nearest neighbors and decision tree models lead the pack with a perfect accuracy of 100%. This is too good to believe and might be a result of using a small data set and/or overfitting: We can use any one of the two winning models, k-nearest neighbors (KNN) or decision tree model, for deployment. We can accomplish this using the following code snippet: best_accuracy = 0 best_classifier = 0 best_pipeline = '' for idx, val in enumerate(pipelines): if val.score(X_test, y_test) > best_accuracy: best_accuracy = val.score(X_test, y_test) best_pipeline = val best_classifier = idx print('%s Classifier has the best accuracy of %.2f' % (pipe_dic[best_classifier],best_accuracy)) As the accuracies were similar for k-nearest neighbor and decision tree, KNN was chosen to be the best model, as it was the first model in the pipeline. However, at this stage, we can also use some business rules or access the execution cost to decide the best model: To summarize, we learned about building pipelines for ML systems. The concepts that we described in this article gave you a foundation for creating pipelines. To have a clearer understanding of the different aspects of Automated Machine Learning, and how to incorporate automation tasks using practical datasets, do checkout the book Hands-On Automated Machine Learning. Read more What is Automated Machine Learning (AutoML)? 5 ways Machine Learning is transforming digital marketing How to improve interpretability of machine learning systems

0
0
12578

How-To Tutorials - Data

What is a convolutional neural network (CNN)? [Video]

How far will Facebook go to fix what it broke: Democracy, Trust, Reality

How Facebook is advancing artificial intelligence [Video]

Getting started with Amazon Machine Learning workflow [Tutorial]

Understanding Amazon Machine Learning Workflow [ Tutorial ]

Four IBM facial recognition patents in 2018, we found intriguing

Time series modeling: What is it, Why it matters and How it's used

Diffractive Deep Neural Network (D2NN): UCLA-developed AI device can identify objects at the speed of light

What leaders at successful agile Enterprises share in common

How does Elasticsearch work? [Tutorial]

Trending Topics

DeepCube: A new deep reinforcement learning approach solves the Rubik’s cube with no human help

Creating effective dashboards using Splunk [Tutorial]

23andMe shares 5mn client genetic data with GSK for drug target discovery, a machine learning application in genetics research

Setting up Apache Druid in Hadoop for Data visualizations [Tutorial]

Use AutoML for building simple to complex machine learning pipelines [Tutorial]

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access