How-To Tutorials

article-image-build-intelligent-interfaces-with-coreml-using-a-cnn-tutorial

03 Sep 2018

19 min read

Build intelligent interfaces with CoreML using a CNN [Tutorial]

03 Sep 2018

0
0
7937

How-To Tutorials

article-image-getting-started-with-amazon-machine-learning-workflow-tutorial

Melisha Dsouza

02 Sep 2018

14 min read

Getting started with Amazon Machine Learning workflow [Tutorial]

Melisha Dsouza

02 Sep 2018

14 min read

Amazon Machine Learning is useful for building ML models and generating predictions. It also enables the development of robust and scalable smart applications. The process of building ML models with Amazon Machine Learning consists of three operations: data analysis model training evaluation. The code files for this article are available on Github. This tutorial is an excerpt from a book written by Alexis Perrier titled Effective Amazon Machine Learning. The Amazon Machine Learning service is available at https://console.aws.amazon.com/machinelearning/. The Amazon ML workflow closely follows a standard Data Science workflow with steps: Extract the data and clean it up. Make it available to the algorithm. Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part. Select the best model by training several models on the training dataset and comparing their performances on the validation dataset. Use the best model for predictions on new data. As shown in the following Amazon ML menu, the service is built around four objects: Datasource ML model Evaluation Prediction The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. Let us take a closer look at each one of these steps. Understanding the dataset used We will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm. This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. We do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable. We will randomly select 20% of the rows as the held-out subset to use for prediction on previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor: By creating a new column with randomly generated numbers Sorting the spreadsheet by that column Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split) Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967. As with all datasets, scripts, and resources mentioned in this book, the training and holdout files are available in the GitHub repository at https://github.com/alexperrier/packt-aml. It is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model. Loading the data on S3 Follow these steps to load the training and held-out datasets on S3: Go to your s3 console at https://console.aws.amazon.com/s3. Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration. Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu: Both files are small, only a few KB, and hosting costs should remain negligible for that exercise. Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on. Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv. Declaring a datasource Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default: As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user: Specifying a Datasource name is useful to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot: Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents. Creating the datasource An Amazon ML datasource is composed of the following: The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3 The schema that contains information on the type of the variables contained in the CSV file: Categorical Text Numeric (real-valued) Binary It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML. At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has: Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot: Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following: The first line in the CSV file contains te column name The target is the weight We see here that Amazon ML has correctly inferred the following: sex is categorical age, height, and weight are numeric (continuous real values) Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data: predicted weight = a * age + b * height + c * sex Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are useful when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model. You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model. Understanding the model We select the default parameters for the training and evaluation settings. Amazon ML will do the following: Create a recipe for data transformation based on the statistical properties it has inferred from the dataset Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially. The recipe will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed. The default advanced settings for the model are shown in the following screenshot: We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on. Regularization comes in 3 levels with a mild (10^-6), medium (10^-4), or aggressive (10^-02) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^-6) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set). Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending. At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation. It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following: { "splitting": { "percentBegin": 0, "percentEnd": 70 } } You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot: While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters: Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100. Making several passes over the training data while shuffling the samples before each path. At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made. At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version. Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation. Evaluating the model Amazon ML uses the standard metric RMSE for linear regression. RMSE is defined as the sum of the squares of the difference between the real values and the predicted values: Here, ŷ is the predicted values, and y the real values we want to predict (the weight of the children in our case). The closer the predictions are to the real values, the lower the RMSE is. A lower RMSE means a better, more accurate prediction. Making batch predictions We now have a model that has been properly trained and selected among other models. We can use it to make predictions on new data. A batch prediction consists in applying a model to a datasource in order to make predictions on that datasource. We need to tell Amazon ML which model we want to apply on which data. Batch predictions are different from streaming predictions. With batch predictions, all the data is already made available as a datasource, while for streaming predictions, the data will be fed to the model as it becomes available. The dataset is not available beforehand in its entirety. In the Main Menu select Batch Predictions to access the dashboard predictions and click on Create a New Prediction: The first step is to select one of the models available in your model dashboard. You should choose the one that has the lowest RMSE: The next step is to associate a datasource to the model you just selected. We had uploaded the held-out dataset to S3 at the beginning of this chapter (under the Loading the data on S3 section) but had not used it to create a datasource. We will do so now.When asked for a datasource in the next screen, make sure to check My data is in S3, and I need to create a datasource, and then select the held-out dataset that should already be present in your S3 bucket: Don't forget to tell Amazon ML that the first line of the file contains columns. In our current project, our held-out dataset also contains the true values for the weight of the students. This would not be the case for "real" data in a real-world project where the real values are truly unknown. However, in our case, this will allow us to calculate the RMSE score of our predictions and assess the quality of these predictions. The final step is to click on the Verify button and wait for a few minutes: Amazon ML will run the model on the new datasource and will generate predictions in the form of a CSV file. Contrary to the evaluation and model-building phase, we now have real predictions. We are also no longer given a score associated with these predictions. After a few minutes, you will notice a new batch-prediction folder in your S3 bucket. This folder contains a manifest file and a results folder. The manifest file is a JSON file with the path to the initial datasource and the path to the results file. The results folder contains a gzipped CSV file: Uncompressed, the CSV file contains two columns, trueLabel, the initial target from the held-out set, and score, which corresponds to the predicted values. We can easily calculate the RMSE for those results directly in the spreadsheet through the following steps: Creating a new column that holds the square of the difference of the two columns. Summing all the rows. Taking the square root of the result. The following illustration shows how we create a third column C, as the squared difference between the trueLabel column A and the score (or predicted value) column B: As shown in the following screenshot, averaging column C and taking the square root gives an RMSE of 11.96, which is even significantly better than the RMSE we obtained during the evaluation phase (RMSE 14.4): The fact that the RMSE on the held-out set is better than the RMSE on the validation set means that our model did not overfit the training data, since it performed even better on new data than expected. Our model is robust. The left side of the following graph shows the True (Triangle) and Predicted (Circle) Weight values for all the samples in the held-out set. The right side shows the histogram of the residuals. Similar to the histogram of residuals we had observed on the validation set, we observe that the residuals are not centered on 0. Our model has a tendency to overestimate the weight of the students: In this tutorial, we have successfully performed the loading of the data on S3 and let Amazon ML infer the schema and transform the data. We also created a model and evaluated its performance. Finally, we made a prediction on the held -out dataset. To understand how to leverage Amazon's powerful platform for your predictive analytics needs, check out this book Effective Amazon Machine Learning. Four interesting Amazon patents in 2018 that use machine learning, AR, and robotics Amazon Sagemaker makes machine learning on the cloud easy Amazon ML Solutions Lab to help customers “work backwards” and leverage machine learning

0
0
10120

article-image-should-you-use-javascript-for-machine-learning-and-how-do-you-get-started

Sugandha Lahoti

01 Sep 2018

4 min read

Why use JavaScript for machine learning?

Sugandha Lahoti

01 Sep 2018

4 min read

Python has always been and remains the language of choice for machine learning, in part due to the maturity of the language, in part due to the maturity of the ecosystem, and in part due to the positive feedback loop of early ML efforts in Python. Recent developments in the JavaScript world, however, are making JavaScript more attractive to ML projects. I think we will see a major ML renaissance in JavaScript within a few years, especially as laptops and mobile devices become ever more powerful and JavaScript itself surges in popularity. This post is extracted from the book Hands-on Machine Learning with JavaScript by Burak Kanber. The book is a definitive guide to creating intelligent web applications with the best of machine learning and JavaScript. Advantages and challenges of JavaScript JavaScript, like any other tool, has its advantages and disadvantages. Much of the historical criticism of JavaScript has focused on a few common themes: strange behavior in type coercion, the prototypical object-oriented model, difficulty organizing large codebases, and managing deeply nested asynchronous function calls with what many developers call callback hell. Fortunately, most of these historic gripes have been resolved by the introduction of ES6, that is, ECMAScript 2015, a recent update to the JavaScript syntax. Con: Immature ecosystem for machine learning development Despite the recent language improvements, most developers would still advise against using JavaScript for ML for one reason: the ecosystem. The Python ecosystem for ML is so mature and rich that it's difficult to justify choosing any other ecosystem. But this logic is self-fulfilling and self-defeating; we need brave individuals to take the leap and work on real ML problems if we want JavaScript's ecosystem to mature. Fortunately, JavaScript has been the most popular programming language on GitHub for a few years running, and is growing in popularity by almost every metric. Pro #1: JavaScript is the most popular web development language with a mature npm ecosystem There are some advantages to using JavaScript for ML. Its popularity is one; while ML in JavaScript is not very popular at the moment, the language itself is. As demand for ML applications rises, and as hardware becomes faster and cheaper, it's only natural for ML to become more prevalent in the JavaScript world. There are tons of resources available for learning JavaScript in general, maintaining Node.js servers, and deploying JavaScript applications. The Node Package Manager (npm) ecosystem is also large and still growing, and while there aren't many very mature ML packages available, there are a number of well built, useful tools out there that will come to maturity soon. Pro #2: JavaScript is now a general purpose, cross-platform programming language Another advantage to using JavaScript is the universality of the language. The modern web browser is essentially a portable application platform which allows you to run your code, basically without modification, on nearly any device. Tools like electron (while considered by many to be bloated) allow developers to quickly develop and deploy downloadable desktop applications to any operating system. Node.js lets you run your code in a server environment. React Native brings your JavaScript code to the native mobile application environment, and may eventually allow you to develop desktop applications as well. JavaScript is no longer confined to just dynamic web interactions, it's now a general-purpose, cross-platform programming language. Pro #3: JavaScript makes Machine Learning accessible to web and front-end developers Finally, using JavaScript makes ML accessible to web and frontend developers, a group that historically has been left out of the ML discussion. Server-side applications are typically preferred for ML tools, since the servers are where the computing power is. That fact has historically made it difficult for web developers to get into the ML game, but as hardware improves, even complex ML models can be run on the client, whether it's the desktop or the mobile browser. If web developers, frontend developers, and JavaScript developers all start learning about ML today, that same community will be in a position to improve the ML tools available to us all tomorrow. If we take these technologies and democratize them, expose as many people as possible to the concepts behind ML, we will ultimately elevate the community and seed the next generation of ML researchers. Summary In this article, we've discussed the important moments of JavaScript's history as applied to ML. We’ve discussed some advantages to using JavaScript for machine learning, and also some of the challenges we’re facing, particularly in terms of the machine learning ecosystem. To begin exploring and processing the data itself, read our book Hands-on Machine Learning with JavaScript. 5 JavaScript machine learning libraries you need to know V8 JavaScript Engine releases version 6.9! HTML5 and the rise of modern JavaScript browser APIs [Tutorial]

0
0
43869

How-To Tutorials

article-image-how-to-build-a-real-time-data-pipeline-for-web-developers-part-2-tutorial

Sugandha Lahoti

30 Aug 2018

15 min read

How to build a real-time data pipeline for web developers - Part 2 [Tutorial]

Sugandha Lahoti

30 Aug 2018

15 min read

Our previous post talked about two components to build a real-time data pipeline. To recap: Most data pipelines contain these components: Data querying and event subscription Data joining or aggregation Transformation and normalization Storage and delivery In this post, we will talk about the last two components and introduce some tools and techniques that can achieve them. This post is extracted from the book Hands-on Machine Learning with JavaScript by Burak Kanber. The book is a definitive guide to creating an intelligent web application with the best of machine learning and JavaScript. Transformation and normalization of a data pipeline As your data makes its way through a pipeline, it may need to be converted into a structure compatible with your algorithm's input layer. There are many possible transformations that can be performed on the data in the pipeline. For example, in order to protect sensitive user data before it reaches a token-based classifier, you might apply a cryptographic hashing function to the tokens so that they are no longer human readable. Types of Data transformations More typically, the types of transformations will be related to sanitization, normalization, or transposition. A sanitization operation might involve removing unnecessary whitespace or HTML tags, removing email addresses from a token stream, and removing unnecessary fields from the data structure. If your pipeline has subscribed to an event stream as the source of the data and the event stream attaches source server IP addresses to event data, it would be a good idea to remove these values from the data structure, both in order to save space and to minimize the surface area for potential data leaks. Similarly, if email addresses are not necessary for your classification algorithm, the pipeline should remove that data so that it interacts with the fewest possible servers and systems. If you've designed a spam filter, you may want to look into using only the domain portion of the email address instead of the fully qualified address. Alternately, the email addresses or domains may be hashed by the pipeline so that the classifier can still recognize them but a human cannot. Make sure to audit your data for other potential security and privacy issues as well. If your application collects the end user's IP address as part of its event stream, but the classifier does not need that data, remove it from the pipeline as early as possible. These considerations are becoming ever more important with the implementation of new European privacy laws, and every developer should be aware of privacy and compliance concerns. Data Normalization A common category of data transformation is normalization. When working with a range of numerical values for a given field or feature, it's often desirable to normalize the range such that it has a known minimum and maximum bound. One approach is to normalize all values of the same field to the range [0,1], using the maximum encountered value as the divisor (for example, the sequence 1, 2, 4 can be normalized to 0.25, 0.5, 1). Whether data needs to be normalized in this manner will depend entirely on the algorithm that consumes the data. Another approach to normalization is to convert values into percentiles. In this scheme, very large outlying values will not skew the algorithm too drastically. If most values lie between 0 and 100 but a few points include values such as 50,000, an algorithm may give outsized precedence to the large values. If the data is normalized as a percentile, however, you are guaranteed to not have any values exceeding 100 and the outliers are brought into the same range as the rest of the data. Whether or not this is a good thing depends on the algorithm. Instagram Example The data pipeline is also a good place to calculate derived or second-order features. Imagine a random forest classifier that uses Instagram profile data to determine if the profile belongs to a human or a bot. The Instagram profile data will include fields such as the user's followers count, friends count, posts count, website, bio, and username. A random forest classifier will have difficulty using those fields in their original representations, however, by applying some simple data transformations, you can achieve accuracies of 90%. In the Instagram case, one type of helpful data transformation is calculating ratios. Followers count and friends count, as separate features or signals, may not be useful to the classifier since they are treated somewhat independently. But the friends-to-followers ratio can turn out to be a very strong signal that may expose bot users. An Instagram user with 1,000 friends doesn't raise any flags, nor would an Instagram user with 50 followers; treated independently, these features are not strong signals. However, an Instagram user with a friends-to-followers ratio of 20 (or 1,000/50) is almost certainly a bot designed to follow other users. Similarly, a ratio such as posts-versus-followers or posts-versus-friends may end up being a stronger signal than any of those features independently. Text content such as the Instagram user's profile bio, website, or username is made useful by deriving second-order features from them as well. A classifier may not be able to do anything with a website's URL, but perhaps a Boolean has_profile_website feature can be used as a signal instead. If, in your research, you notice that usernames of bots tend to have a lot of numbers in them, you can derive features from the username itself. One feature can calculate the ratio of letters to numbers in the username, another Boolean feature can represent whether the username has a number at the end or beginning, and a more advanced feature could determine if dictionary words were used in the username or not (therefore distinguishing between @themachinelearningwriter and something gibberish like @panatoe234). Derived features can be of any level of sophistication or simplicity. Another simple feature could be whether the Instagram profile contains a URL in the profile bio field (as opposed to the dedicated website field); this can be detected with a regex and the Boolean value used as the feature. A more advanced feature could automatically detect whether the language used in the user's content is the same as the language specified by the user's locale setting. If the user claims they're in France but always writes captions in Russian it may indeed be a Russian living in France, but when combined with other signals like a friends-to-followers ratio far from 1, this information may be indicative of a bot user. There are lower level transformations that may need to be applied to the data in the pipeline as well. If the source data is in an XML format but the classifier requires JSON formatting, the pipeline should take responsibility for the parsing and conversion of formats. Other mathematical transformations may also be applied. If the native format of the data is row-oriented but the classifier needs column-oriented data, the pipeline can perform a vector transposition operation as part of the processing. Similarly, the pipeline can use mathematical interpolation to fill in missing values. If your pipeline subscribes to events emitted by a suite of sensors in a laboratory setting and a single sensor goes offline for a couple of measurements, it may be reasonable to interpolate between the two known values in order to fill in the missing data. In other cases, missing values can be replaced with the population's mean or median value. Replacing missing values with a mean or median will often result in the classifier deprioritizing that feature for that data point, as opposed to breaking the classifier by giving it a null value. What to consider when transforming and normalizing In general, there are two things to consider in terms of transformation and normalization within a data pipeline. The first is the mechanical details of the source data and the target format: XML data must be transformed to JSON, rows must be converted to columns, images must be converted from JPEG to BMP formats, and so on. The mechanical details are not too tricky to work out, as you will already be aware of the source and target formats required by the system. The other consideration is the semantic or mathematical transformation of your data. This is an exercise in feature selection and feature engineering, and is not as straightforward as the mechanical transformation. Determining which second-order features to derive is both art and science. The art is coming up with new ideas for derived features, and the science is to rigorously test and experiment with your work. In my experience with Instagram bot detection, for instance, I found that the letters-to-numbers ratio in Instagram usernames was a very weak signal. I abandoned that idea after some experimentation in order to avoid adding unnecessary dimensionality to the problem. At this point, we have a hypothetical data pipeline that collects data, joins and aggregates it, processes it, and normalizes it. We're almost done, but the data still needs to be delivered to the algorithm itself. Once the algorithm is trained, we might also want to serialize the model and store it for later use. In the next section, we'll discuss a few considerations to make when transporting and storing training data or serialized models. Storing and delivering data in a data pipeline Once your data pipeline has applied all the necessary processing and transformations, it has one task left to do: deliver the data to your algorithm. Ideally, the algorithm will not need to know about the implementation details of the data pipeline. The algorithm should have a single location that it can interact with in order to get the fully processed data. This location could be a file on disk, a message queue, a service such as Amazon S3, a database, or an API endpoint. The approach you choose will depend on the resources available to you, the topology or architecture of your server system, and the format and size of the data. Models that are trained only periodically are typically the simplest case to handle. If you're developing an image recognition RNN that learns labels for a number of images and only needs to be retrained every few months, a good approach would be to store all the images as well as a manifest file (relating image names to labels) in a service such as Amazon S3 or a dedicated path on disk. The algorithm would first load and parse the manifest file and then load the images from the storage service as needed. Similarly, an Instagram bot detection algorithm may only need to be retrained every week or every month. The algorithm can read training data directly from a database table or a JSON or CSV file stored on S3 or a local disk. It is rare to have to do this, but in some exotic data pipeline implementations you could also provide the algorithm with a dedicated API endpoint built as a microservice; the algorithm would simply query the API endpoint first for a list of training point references, and then request each in turn from the API. Models which require online updates or near-real-time updates, on the other hand, are best served by a message queue. If a Bayesian classifier requires live updates, the algorithm can subscribe to a message queue and apply updates as they come in. Even when using a sophisticated multistage pipeline, it is possible to process new data and update a model in fractions of a second if you've designed all the components well. Storing and Delivering Data in a Spam Filter Returning to the spam filter example, we can design a highly performant data pipeline like so: first, an API endpoint receives feedback from a user. In order to keep the user interface responsive, this API endpoint is responsible only for placing the user's feedback into a message queue and can finish its task in under a millisecond. The data pipeline in turn subscribes to the message queue, and in another few milliseconds is made aware of a new message. The pipeline then applies a few simple transformations to the message, like tokenizing, stemming, and potentially even hashing the tokens. The next stage of the pipeline transforms the token stream into a hashmap of tokens and their counts (for example, from hey hey there to {hey: 2, there: 1}); this avoids the need for the classifier to update the same token's count more than once. This stage of processing will only require another couple of milliseconds at worst. Finally, the fully processed data is placed in a separate message queue which the classifier subscribes to. Once the classifier is made aware of the data it can immediately apply the updates to the model. If the classifier is backed by Redis, for instance, this final stage will also require only a few milliseconds. The entire process we have described, from the time the user's feedback reaches the API server to the time the model is updated, may only require 20 ms. Considering that communication over the internet (or any other means) is limited by the speed of light, the best-case scenario for a TCP packet making a round-trip between New York and San Francisco is 40 ms; in practice, the average cross-country latency for a good internet connection is about 80 ms. Our data pipeline and model is therefore capable of updating itself based on user feedback a full 20 ms before the user will even receive their HTTP response. Not every application requires real-time processing. Managing separate servers for an API, a data pipeline, message queues, a Redis store, and hosting the classifier might be overkill both in terms of effort and budget. You'll have to determine what's best for your use case. Storage and Delivery of the model The last thing to consider is not related to the data pipeline but rather the storage and delivery of the model itself, in the case of a hybrid approach where a model is trained on the server but evaluated on the client. The first question to ask yourself is whether the model is considered public or private. Private models should not be stored on a public Amazon S3 bucket, for instance; instead, the S3 bucket should have access control rules in place and your application will need to procure a signed download link with an expiration time (the S3 API assists with this). The next consideration is how large the model is and how often it will be downloaded by clients. If a public model is downloaded frequently but updated infrequently, it might be best to use a CDN in order to take advantage of edge caching. If your model is stored on Amazon S3, for example, then the Amazon CloudFront CDN would be a good choice. Of course, you can always build your own storage and delivery solution. In this post, I have assumed a cloud architecture, however if you have a single dedicated or collocated server then you may simply want to store the serialized model on disk and serve it either through your web server software or through your application's API. When dealing with large models, make sure to consider what will happen if many users attempt to download the model simultaneously. You may inadvertently saturate your server's network connection if too many people request the file at once, you might overrun any bandwidth limits set by your server's ISP, or you might end up with your server's CPU stuck in I/O wait while it moves data around. No one-size-fits-all solution for data pipelining As mentioned previously, there's no one-size-fits-all solution for data pipelining. If you're a hobbyist developing applications for fun or just a few users, you have lots of options for data storage and delivery. If you're working in a professional capacity on a large enterprise project, however, you will have to consider all aspects of the data pipeline and how they will impact your application's performance. I will offer one final piece of advice to the hobbyists reading this section. While it's true that you don't need a sophisticated, real-time data pipeline for hobby projects, you should build one anyway. Being able to design and build real-time data pipelines is a highly marketable and valuable skill that not many people possess, and if you're willing to put in the practice to learn ML algorithms then you should also practice building performant data pipelines. I'm not saying that you should build a big, fancy data pipeline for every single hobby project—just that you should do it a few times, using several different approaches, until you're comfortable not just with the concepts but also the implementation. Practice makes perfect, and practice means getting your hands dirty. In this two part series, we learned about data pipelines and various mechanisms that manage the collection, combination, transformation, and delivery of data from one system to the next. Next, to exactly choose the right ML algorithm for a given problem, read our book Hands-on Machine Learning with JavaScript. Create machine learning pipelines using unsupervised AutoML [Tutorial] Top AutoML libraries for building your ML pipelines

0
0
15577

How-To Tutorials

article-image-how-to-build-a-real-time-data-pipeline-for-web-developers-part-1-tutorial

Sugandha Lahoti

29 Aug 2018

12 min read

How to build a real-time data pipeline for web developers - Part 1 [Tutorial]

Sugandha Lahoti

29 Aug 2018

12 min read

There are many differences between the idealized usage of ML algorithms and real-world usage. This post gives advice related to using ML in the real world, in real applications, and in production environments. Specifically, we will talk about how to build a real-time data pipeline. The article aims to answer the following questions How do you collect, store, and process gigabytes or terabytes of training data? How and where do you store and distribute serialized models to clients? How do you collect new training examples from millions of users? This post is extracted from the book Hands-on Machine Learning with JavaScript by Burak Kanber. The book is a definitive guide to creating an intelligent web application with the best of machine learning and JavaScript. What are Data pipelines? When developing a production ML system, it's not likely that you will have the training data handed to you in a ready-to-process format. Production ML systems are typically part of larger application systems, and the data that you use will probably originate from several different sources. The training set for an ML algorithm may be a subset of your larger database, combined with images hosted on a Content Delivery Network (CDN) and event data from an Elasticsearch server. The process of ushering data through various stages of a life cycle is called data pipelining. Data pipelining may include data selectors that run SQL or Elasticsearch queries for objects, event subscriptions which allow data to flow in from event-or log-based data, aggregations, joins, combining data with data from third-party APIs, sanitization, normalization, and storage. In an ideal implementation, the data pipeline acts as an abstraction layer between the larger application environment and the ML process. The ML algorithm should be able to read the output of the data pipeline without any knowledge of the original source of the data, similar to our examples. As there are many possible data sources and infinite ways to architect an application, there is no one-size-fits-all data pipeline. However, most data pipelines will contain these components, which we will discuss in the following sections: Data querying and event subscription Data joining or aggregation Transformation and normalization Storage and delivery This article is a two-part post. In the first part, we will talk about Data Querying and event subscription and Data joining. Data querying Imagine an application such as Disqus, which is an embeddable comment form that website owners can use to add comment functionality to blog posts or other pages. The primary functionality of Disqus is to allow users to like or leave comments on posts, however, as an additional feature and revenue stream, Disqus can make content recommendations and display them alongside sponsored content. The content recommendation system is an example of an ML system that is only one feature of a larger application. A content recommendation system in an application such as Disqus does not necessarily need to interact with the comment data, but might use the user's likes history to generate recommendations similar to the current page. Such a system would also need to analyze the text content of the liked pages and compare that to the text content of all pages in the network in order to make recommendations. Disqus does not need the post's content in order to provide comment functionality, but does need to store metadata about the page (like its URL and title) in its database. The post content may therefore not reside in the application's main database, though the likes and page metadata would likely be stored there. A data pipeline built around Disqus's recommendation system needs first to query the main database for pages the user has liked—or pages that were liked by users who liked the current page—and return their metadata. In order to find similar content, however, the system will need to use the text content of each liked post. This data might be stored in a separate system, perhaps a secondary database such as MongoDB or Elasticsearch, or in Amazon S3 or some other data warehouse. The pipeline will need to retrieve the text content based on the metadata returned by the main database, and associate the content with the metadata. This is an example of multiple data selectors or data sources in the early stages of a data pipeline. One data source is the primary application data, which stores post and likes metadata. The other data source is a secondary server which stores the post's text content. The next step in this pipeline might involve finding a number of candidate posts similar to the ones the user has liked, perhaps through a request to Elasticsearch or some other service that can find similar content. Similar content is not necessarily the correct content to serve, however, so these candidate articles will ultimately be ranked by an (hypothetical) ANN in order to determine the best content to display. In this example, the input to the data pipeline is the current page and the output from the data pipeline is a list of, say, 200 similar pages that the ANN will then rank. If all the necessary data resides in the primary database, the entire pipeline can be achieved with an SQL statement and some JOINs. Even in this case, care should be taken to develop a degree of abstraction between the ML algorithm and the data pipeline, as you may decide to update the application's architecture in the future. In other cases, however, the data will reside in separate locations and a more considered pipeline should be developed. There are many ways to build this data pipeline. You could develop a JavaScript module that performs all the pipeline tasks, and in some cases, you could even write a bash script using standard Unix tools to accomplish the task. On the other end of the complexity spectrum, there are purpose-built tools for data pipelining such as Apache Kafka and AWS Pipeline. These systems are designed modularly and allow you to define a specific data source, query, transformation, and aggregation modules as well as the workflows that connect them. In AWS Pipeline, for instance, you define data nodes that understand how to interact with the various data sources in your application. The earliest stage of a pipeline is typically some sort of data query operation. Training examples must be extracted from a larger database, keeping in mind that not every record in a database is necessarily a training example. In the case of a spam filter, for instance, you should only select messages that have been marked as spam or not spam by a user. Messages that were automatically marked as spam by a spam filter should probably not be used for training, as that might cause a positive feedback loop that ultimately causes an unacceptable false positive rate. Similarly, you may want to prevent users that have been blocked or banned by your system from influencing your model training. A bad actor could intentionally mislead an ML model by taking inappropriate actions on their own data, so you should disqualify these data points as training examples. Alternatively, if your application is such that recent data points should take precedence over older training points, your data query operation might set a time-based limit on the data to use for training, or select a fixed limit ordered reverse chronologically. No matter the situation, make sure you carefully consider your data queries as they are an essential first step in your data pipeline. Not all data needs to come from database queries, however. Many applications use a pub/sub or event subscription architecture to capture streaming data. This data could be activity logs aggregated from a number of servers, or live transaction data from a number of sources. In these cases, an event subscriber will be an early part of your data pipeline. Note that event subscription and data querying are not mutually exclusive operations. Events that come in through a pub/sub system can still be filtered based on various criteria; this is still a form of data querying. One potential issue with an event subscription model arises when it's combined with a batch-training scheme. If you require 5,000 data points but receive only 100 per second, your pipeline will need to maintain a buffer of data points until the target size is reached. There are various message-queuing systems that can assist with this, such as RabbitMQ or Redis. A pipeline requiring this type of functionality might hold messages in a queue until the target of 5,000 messages is achieved, and only then release the messages for batch processing through the rest of the pipeline. In the case that data is collected from multiple sources, it most likely will need to be joined or aggregated in some manner. Let's now take a look at a situation where data needs to be joined to data from an external API. Data joining and aggregation Let's return to our example of the Disqus content recommendation system. Imagine that the data pipeline is able to query likes and post metadata directly from the primary database, but that no system in the applications stores the post's text content. Instead, a microservice was developed in the form of an API that accepts a post ID or URL and returns the page's sanitized text content. In this case, the data pipeline will need to interact with the microservice API in order to get the text content for each post. This approach is perfectly valid, though if the frequency of post content requests is high, some caching or storage should probably be implemented. The data pipeline will need to employ an approach similar to the buffering of messages in the event subscription model. The pipeline can use a message queue to queue posts that still require content, and make requests to the content microservice for each post in the queue until the queue is depleted. As each post's content is retrieved it is added to the post metadata and stored in a separate queue for completed requests. Only when the source queue is depleted and the sink queue is full should the pipeline move on to the next step. Data joining does not necessarily need to involve a microservice API. If the pipeline collects data from two separate sources that need to be combined, a similar approach can be employed. The pipeline is the only component that needs to understand the relationship between the two data sources and formats, leaving both the data sources and the ML algorithm to operate independently of those details. The queue approach also works well when a data aggregation is required. An example of this situation is a pipeline in which the input is streaming input data and the output is token counts or value aggregations. Using a message queue is desirable in these situations as most message queues ensure that a message can be consumed only once, therefore preventing any duplication by the aggregator. This is especially valuable when the event stream is very high frequency, such that tokenizing each event as it comes in would lead to backups or server overload. Because message queues ensure that each message is consumed only once, high-frequency event data can stream directly into a queue where messages are consumed by multiple workers in parallel. Each worker might be responsible for tokenizing the event data and then pushing the token stream to a different message queue. The message queue software ensures that no two workers process the same event, and each worker can operate as an independent unit that is only concerned with tokenization. As the tokenizers push their results onto a new message queue, another worker can consume those messages and aggregate token counts, delivering its own results to the next step in the pipeline every second or minute or 1,000 events, whatever is appropriate for the application. The output of this style of pipeline might be fed into a continually updating Bayesian model, for example. One benefit of a data pipeline designed in this manner is performance. If you were to attempt to subscribe to high-frequency event data, tokenize each message, aggregate token counts, and update a model all in one system, you might be forced to use a very powerful (and expensive) single server. The server would simultaneously need a high-performance CPU, lots of RAM, and a high-throughput network connection. By breaking up the pipeline into stages, however, you can optimize each stage of the pipeline for its specific task and load condition. The message queue that receives the source event stream needs only to receive the event stream but does not need to process it. The tokenizer workers do not necessarily need to be high-performance servers, as they can be run in parallel. The aggregating queue and worker will process a large volume of data but will not need to retain data for longer than a few seconds and therefore may not need much RAM. The final model, which is a compressed version of the source data, can be stored on a more modest machine. Many components of the data pipeline can be built of commodity hardware simply because a data pipeline encourages modular design. In many cases, you will need to transform your data from format to format throughout the pipeline. That could mean converting from native data structures to JSON, transposing or interpolating values, or hashing values. We talked about two data pipelines components. In the next post, we will discuss several types of data transformations that may occur in the data pipeline. We will also discuss a few considerations to make when transporting and storing training data or serialized models. Create machine learning pipelines using unsupervised AutoML [Tutorial] Top AutoML libraries for building your ML pipelines

0
0
21096

How-To Tutorials

article-image-intelligent-mobile-projects-with-tensorflow-build-a-basic-raspberry-pi-robot-that-listens-moves-sees-and-speaks-tutorial

Bhagyashree R

27 Aug 2018

14 min read

Intelligent mobile projects with TensorFlow: Build a basic Raspberry Pi robot that listens, moves, sees, and speaks [Tutorial]

Bhagyashree R

27 Aug 2018

14 min read

0
0
24043

How-To Tutorials

article-image-unit-testing-angular-components-and-classes

Natasha Mathur

27 Aug 2018

12 min read

Unit testing Angular components and classes [Tutorial]

Natasha Mathur

27 Aug 2018

12 min read

Testing (and more specifically, unit testing) is meant to be carried out by the developer as the project is being developed. In this article, we will see how to implement testing tools to perform proper unit testing for your application classes and components. This tutorial is an excerpt taken from the book Learning Angular (Second Edition) written by Christoffer Noring, Pablo Deeleman. When venturing into unit testing in Angular, it's important to know what major parts it consists of. In Angular these are: Jasmine, the testing framework Angular testing utilities Karma, a test runner for running unit tests, among other things Protractor, Angular's framework for E2E testing Configuration and setting up of Angular CLI In terms of configuration, when using the Angular CLI, you don't have to do anything to make it work. You can, as soon as you scaffold a project, run your first test and it will work. The Angular CLI is using Karma as the test runner. What we need to know about Karma is that it uses a karma.conf.js file, a configuration file, in which a lot of things are specified, such as: The various plugins that enhance your test runner. Where to find the tests to run? It should be said that there is usually a files property in this file specifying where to find the application and the tests. For the Angular CLI, however, this specification is found in another file called src/tscconfig-spec.json. Setup of your selected coverage tool, a tool that measures to what degree your tests cover the production code. Reporters report every executed test in a console window, to a browser, or through some other means. Browsers run your tests in: for example, Chrome or PhantomJS. Using the Angular CLI, you most likely won't need to change or edit this file yourself. It is good to know that it exists and what it does for you. Angular testing utilities The Angular testing utilities help to create a testing environment that makes writing tests for your various constructs really easy. It consists of the TestBed class and various helper functions, found under the @angular/core/testing namespace. Let's have a look at what these are and how they can help us to test various constructs. We will shortly introduce the most commonly used concepts so that you are familiar with them as we present them more deeply further on: The TestBed class is the most important concept and creates its own testing module. In reality, when you test out a construct to detach it from the module it resides in and reattach it to the testing module created by the TestBed. The TestBed class has a configureTestModule() helper method that we use to set up the test module as needed. The TestBed can also instantiate components. ComponentFixture is a class wrapping the component instance. This means that it has some functionality on it and it has a member that is the component instance itself. The DebugElement, much like the ComponentFixture, acts as a wrapper. It, however, wraps the DOM element and not the component instance. It's a bit more than that though, as it has an injector on it that allows us to access the services that have been injected into a component. This was a brief overview of our testing environment, the frameworks, and libraries used. Now let's discuss component testing. Introduction to component testing A usual method of operation for doing anything Angular is to use the Angular CLI. Working with tests is no different. The Angular CLI lets us create tests, debug them, and run them; it also gives us an understanding of how well our tests cover the code and its many scenarios. Component testing with dependencies We have learned a lot already, but let's face it, no component that we build will be as simple as the one we wrote in the preceding section. There will almost certainly be at least one dependency, looking like this: @Component({}) export class ExampleComponent { constructor(dependency:Dependency) {} } We have different ways of dealing with testing such a situation. One thing is clear though: if we are testing the component, then we should not test the service as well. This means that when we set up such a test, the dependency should not be the real thing. There are different ways of dealing with that when it comes to unit testing; no solution is strictly better than the other: Using a stub means that we tell the dependency injector to inject a stub that we provide, instead of the real thing. Injecting the real thing, but attaching a spy, to the method that we call in our component. Regardless of the approach, we ensure that the test is not performing a side effect such as talking to a filesystem or attempting to communicate via HTTP; we are, using this approach, isolated. Using a stub to replace the dependency Using a stub means that we completely replace what was there before. It is as simple to do as instructing the TestBed in the following way: TestBed.configureTestingModule({ declarations: [ExampleComponent] providers: [{ provide: DependencyService, useClass: DependencyServiceStub }] }); We define a providers array like we do with the NgModule, and we give it a list item that points out the definition we intend to replace and we give it the replacement instead; that is our stub. Let's now build our DependencyStub to look like this: class DependencyServiceStub { getData() { return 'stub'; } } Just like with an @NgModule, we are able to override the definition of our dependency with our own stub. Imagine our component looks like the following: import { Component } from '@angular/core'; import { DependencyService } from "./dependency.service"; @Component({ selector: 'example', template: ` <div>{{ title }}</div> ` }) export class ExampleComponent { title: string; constructor(private dependency: DependencyService) { this.title = this.dependency.getData(); } } Here we pass an instance of the dependency in the constructor. With our testing module correctly set up, with our stub, we can now write a test that looks like this: it(`should have as title 'stub'`, async(() => { const fixture = TestBed.createComponent(AppComponent); const app = fixture.debugElement.componentInstance; expect(app.title).toEqual('stub'); })); The test looks normal, but at the point when the dependency would be called in the component code, our stub takes its place and responds instead. Our dependency should be overridden, and as you can see, the expect(app.title).toEqual('stub') assumes the stub will answer, which it does. Spying on the dependency method The previously-mentioned approach, using a stub, is not the only way to isolate ourselves in a unit test. We don't have to replace the entire dependency, only the parts that our component is using. Replacing certain parts means that we point out specific methods on the dependency and assign a spy to them. A spy is an interesting construct; it has the ability to answer what you want it to answer, but you can also see how many times it is being called and with what argument/s, so a spy gives you a lot more information about what is going on. Let's have a look at how we would set a spy up: beforeEach(() => { TestBed.configureTestingModule({ declarations: [ExampleComponent], providers: [DependencyService] }); dependency = TestBed.get(DependencyService); spy = spyOn( dependency,'getData'); fixture = TestBed.createComponent(ExampleComponent); }) Now as you can see, the actual dependency is injected into the component. After that, we grab a reference to the component, our fixture variable. This is followed by us using the TestBed.get('Dependency') to get hold of the dependency inside of the component. At this point, we attach a spy to its getData() method through the spyOn( dependency,'getData') call. This is not enough, however; we have yet to instruct the spy what to respond with when being called. Let us do just that: spyOn(dependency,'getData').and.returnValue('spy value'); We can now write our test as usual: it('test our spy dependency', () => { var component = fixture.debugElement.componentInstance; expect(component.title).toBe('spy value'); }); This works as expected, and our spy responds as it should. Remember how we said that spies were capable of more than just responding with a value, that you could also check whether they were invoked and with what? To showcase this, we need to improve our tests a little bit and check for this extended functionality, like so: it('test our spy dependency', () => { var component = fixture.debugElement.componentInstance; expect(spy.calls.any()).toBeTruthy(); }) You can also check for the number of times it was called, with spy.callCount, or whether it was called with some specific arguments: spy.mostRecentCalls.args or spy.toHaveBeenCalledWith('arg1', 'arg2'). Remember if you use a spy, make sure it pays for itself by you needing to do checks like these; otherwise, you might as well use a stub. Spies are a feature of the Jasmine framework, not Angular. The interested reader is urged to research this topic further at http://tobyho.com/2011/12/15/jasmine-spy-cheatsheet/. Async services Very few services are nice and well-behaved, in the sense that they are synchronous. A lot of the time, your service will be asynchronous and the return from it is most likely an observable or a promise. If you are using RxJS with the Http service or HttpClient, it will be observable, but if using the fetch API, it will be a promise. These are two good options for dealing with HTTP, but the Angular team added the RxJS library to Angular to make your life as a developer easier. Ultimately it's up to you, but we recommend going with RxJS. Angular has two constructs ready to tackle the asynchronous scenario when testing: async() and whenStable(): This code ensures that any promises are immediately resolved; it can look more synchronous though fakeAsync() and tick(): This code does what the async does but it looks more synchronous when used Let's describe the async() and whenStable() approaches. Our service has now grown up and is doing something asynchronous when we call it like a timeout or an HTTP call. Regardless of which, the answer doesn't reach us straightaway. By using async() in combination with whenStable(), we can, however, ensure that any promises are immediately resolved. Imagine our service now looks like this: export class AsyncDependencyService { getData(): Promise<string> { return new Promise((resolve, reject) => { setTimeout(() => { resolve('data') }, 3000); }) } } We need to change our spy setup to return a promise instead of returning a static string, like so: spy = spyOn(dependency,'getData') .and.returnValue(Promise.resolve('spy data')); We do need to change inside of our component, like so: import { Component, OnInit } from '@angular/core'; import { AsyncDependencyService } from "./async.dependency.service"; @Component({ selector: 'async-example', template: ` <div>{{ title }}</div> ` }) export class AsyncExampleComponent { title: string; constructor(private service: AsyncDependencyService) { this.service.getData().then(data => this.title = data); } } At this point, it's time to update our tests. We need to do two more things. We need to tell our test method to use the async() function, like so: it('async test', async() => { // the test body }) We also need to call fixture.whenStable() to make sure that the promise will have had ample time to resolve, like so: import { TestBed } from "@angular/core/testing"; import { AsyncExampleComponent } from "./async.example.component"; import { AsyncDependencyService } from "./async.dependency.service"; describe('test an component with an async service', () => { let fixture; beforeEach(() => { TestBed.configureTestingModule({ declarations: [AsyncExampleComponent], providers: [AsyncDependencyService] }); fixture = TestBed.createComponent(AsyncExampleComponent); }); it('should contain async data', async () => { const component = fixture.componentInstance; fixture.whenStable.then(() => { fixture.detectChanges(); expect(component.title).toBe('async data'); }); }); }); This version of doing it works as it should, but feels a bit clunky. There is another approach using fakeAsync() and tick(). Essentially, fakeAsync() replaces the async() call and we get rid of whenStable(). The big benefit, however, is that we no longer need to place our assertion statements inside of the promise's then() callback. This gives us synchronous-looking code. Back to fakeAsync(), we need to make a call to tick(), which can only be called within a fakeAsync() call, like so: it('async test', fakeAsync() => { let component = fixture.componentInstance; fixture.detectChanges(); fixture.tick(); expect(component.title).toBe('spy data'); }); As you can see, this looks a lot cleaner; which version you want to use for async testing is up to you. Testing pipes A pipe is basically a class that implements the PipeTransform interface, thus exposing a transform() method that is usually synchronous. Pipes are therefore very easy to test. We will begin by testing a simple pipe, creating, as we mentioned, a test spec right next to its code unit file. The code is as follows: import { Pipe, PipeTransform } from '@angular/core'; @Pipe({ name: 'formattedpipe' }) export class FormattedPipe implements PipeTransform { transform(value: any, ...args: any[]): any { return "banana" + value; } } Our code is very simple; we take a value and add banana to it. Writing a test for it is equally simple. The only thing we need to do is to import the pipe and verify two things: That it has a transform method That it produces the expected results The following code writes a test for each of the bullet points listed earlier: import FormattedTimePipe from './formatted-time.pipe'; import { TestBed } from '@angular/core/testing'; describe('A formatted time pipe' , () => { let fixture; beforeEach(() => { fixture = new FormattedTimePipe(); }) // Specs with assertions it('should expose a transform() method', () => { expect(typeof formattedTimePipe.transform).toEqual('function'); }); it('should produce expected result', () => { expect(fixture.transform( 'val' )).toBe('bananaval'); }) }); In our beforeEach() method, we set up the fixture by instantiating the pipe class. In the first test, we ensure that the transform() method exists. This is followed by our second test that asserts that the transform() method produces the expected result. We saw how to code powerful tests for our components and pipes. If you found this post useful, be sure to check out the book Learning Angular (Second Edition) to learn about mocking HTTP responses and unit testing for routes, input, and output, directives, etc. Getting started with Angular CLI and build your first Angular Component Building Components Using Angular Why switch to Angular for web development – Interview with Minko Gechev

0
0
29940

How-To Tutorials

article-image-build-botnet-detectors-using-machine-learning-algorithms-in-python-tutorial

Melisha Dsouza

26 Aug 2018

12 min read

Build botnet detectors using machine learning algorithms in Python [Tutorial]

Melisha Dsouza

26 Aug 2018

12 min read

Botnets are connected computers that perform a number of repetitive tasks to keep websites going. Connected devices play an important role in modern life. From smart home appliances, computers, coffee machines, and cameras, to connected cars, this huge shift in our lifestyles has made our lives easier. Unfortunately, these exposed devices could be easily targeted by attackers and cybercriminals who could use them later to enable larger-scale attacks. Security vendors provide many solutions and products to defend against botnets, but in this tutorial, we are going to learn how to build novel botnet detection systems with Python and machine learning techniques. You will find all the code discussed, in addition to some other useful scripts, in the following repository: https://github.com/PacktPublishing/Mastering-Machine-Learning-for-Penetration-Testing/tree/master/Chapter05 This article is an excerpt from a book written by Chiheb Chebbi titled Mastering Machine Learning for Penetration Testing We are going to learn how to build different botnet detection systems with many machine learning algorithms. As a start to a first practical lab, let's start by building a machine learning-based botnet detector using different classifiers. By now, I hope you have acquired a clear understanding about the major steps of building machine learning systems. So, I believe that you already know that, as a first step, we need to look for a dataset. Many educational institutions and organizations are given a set of collected datasets from internal laboratories. One of the most well known botnet datasets is called the CTU-13 dataset. It is a labeled dataset with botnet, normal, and background traffic delivered by CTU University, Czech Republic. During their work, they tried to capture real botnet traffic mixed with normal traffic and background traffic. To download the dataset and check out more information about it, you can visit the following link: https://mcfp.weebly.com/the-ctu-13-dataset-a-labeled-dataset-with-botnet-normal-and-background-traffic.html. The dataset is bidirectional NetFlow files. But what are bidirectional NetFlow files? Netflow is an internet protocol developed by Cisco. The goal of this protocol is to collect IP traffic information and monitor network traffic in order to have a clearer view about the network traffic flow. The main components of a NetFlow architecture are a NetFlow Exporter, a Netflow collector, and a Flow Storage. The following diagram illustrates the different components of a NetFlow infrastructure: When it comes to NetFlow generally, when host A sends an information to host B and from host B to host A as a reply, the operation is named unidirectional NetFlow. The sending and the reply are considered different operations. In bidirectional NetFlow, we consider the flows from host A and host B as one flow. Let's download the dataset by using the following command: $ wget --no-check-certificate https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/CTU-13-Dataset.tar.bz2 Extract the downloaded tar.bz2 file by using the following command: # tar xvjf CTU-13-Dataset.tar.bz2 The file contains all the datasets, with the different scenarios. For the demonstration, we are going to use dataset 8 (scenario 8). You can select any scenario or you can use your own collected data, or any other .binetflow files delivered by other institutions: Load the data using pandas as usual: >>> import pandas as pd >>> data = pd.read_csv("capture20110816-3.binetflow") >>> data['Label'] = data.Label.str.contains("Botnet") Exploring the data is essential in any data-centric project. For example, you can start by checking the names of the features or the columns: >> data.columns The command results in the columns of the dataset: StartTime, Dur, Proto, SrcAddr, Sport, Dir, DstAddr, Dport, State, sTos, dTos, TotPkts, TotBytes, SrcBytes, and Label. The columns represent the features used in the dataset; for example, Dur represents duration, Sport represents the source port, and so on. You can find the full list of features in the chapter's GitHub repository. Before training the model, we need to build some scripts to prepare the data. This time, we are going to build a separate Python script to prepare data, and later we can just import it into the main script. I will call the first script DataPreparation.py. There are many proposals done to help extract the features and prepare data to build botnet detectors using machine learning. In our case, I customized two new scripts inspired by the data loading scripts built by NagabhushanS: from __future__ import division import os, sys import threading After importing the required Python packages, we created a class called Prepare to select training and testing data: class Prepare(threading.Thread): def __init__(self, X, Y, XT, YT, accLabel=None): threading.Thread.__init__(self) self.X = X self.Y = Y self.XT=XT self.YT=YT self.accLabel= accLabel def run(self): X = np.zeros(self.X.shape) Y = np.zeros(self.Y.shape) XT = np.zeros(self.XT.shape) YT = np.zeros(self.YT.shape) np.copyto(X, self.X) np.copyto(Y, self.Y) np.copyto(XT, self.XT) np.copyto(YT, self.YT) for i in range(9): X[:, i] = (X[:, i] - X[:, i].mean()) / (X[:, i].std()) for i in range(9): XT[:, i] = (XT[:, i] - XT[:, i].mean()) / (XT[:, i].std()) The second script is called LoadData.py. You can find it on GitHub and use it directly in your projects to load data from .binetflow files and generate a pickle file. Let's use what we developed previously to train the models. After building the data loader and preparing the machine learning algorithms that we are going to use, it is time to train and test the models. First, load the data from the pickle file, which is why we need to import the pickle Python library. Don't forget to import the previous scripts using: import LoadData import DataPreparation import pickle file = open('flowdata.pickle', 'rb') data = pickle.load(file) Select the data sections: Xdata = data[0] Ydata = data[1] XdataT = data[2] YdataT = data[3] As machine learning classifiers, we are going to try many different algorithms so later we can select the best algorithm for our model. Import the required modules to use four machine learning algorithms from sklearn: from sklearn.linear_model import * from sklearn.tree import * from sklearn.naive_bayes import * from sklearn.neighbors import * Prepare the data by using the previous module build. Don't forget to import DataPreparation by typing import DataPreparation: >>> DataPreparation.Prepare(Xdata,Ydata,XdataT,YdataT) Now, we can train the models; and to do that, we are going to train the model with different techniques so later we can select the most suitable machine learning technique for our project. The steps are like what we learned in previous projects: after preparing the data and selecting the features, define the machine learning algorithm, fit the model, and print out the score after defining its variable. As machine learning classifiers, we are going to test many of them. Let's start with a decision tree: Decision tree model: >>> clf = DecisionTreeClassifier() >>> clf.fit(Xdata,Ydata) >>> Prediction = clf.predict(XdataT) >>> Score = clf.score(XdataT,YdataT) >>> print (“The Score of the Decision Tree Classifier is”, Score * 100) The score of the decision tree classifier is 99% Logistic regression model: >>> clf = LogisticRegression(C=10000) >>> clf.fit(Xdata,Ydata) >>> Prediction = clf.predict(XdataT) >>> Score = clf.score(XdataT,YdataT) >>> print ("The Score of the Logistic Regression Classifier is", Score * 100) The score of the logistic regression classifier is 96% Gaussian Naive Bayes model: >>> clf = GaussianNB() >>> clf.fit(Xdata,Ydata) >>> Prediction = clf.predict(XdataT) >>> Score = clf.score(XdataT,YdataT) >>> print("The Score of the Gaussian Naive Bayes classifier is", Score * 100) The score of the Gaussian Naive Bayes classifier is 72% k-Nearest Neighbors model: >>> clf = KNeighborsClassifier() >>> clf.fit(Xdata,Ydata) >>> Prediction = clf.predict(XdataT) >>> Score = clf.score(XdataT,YdataT) >>> print("The Score of the K-Nearest Neighbours classifier is", Score * 100) The score of the k-Nearest Neighbors classifier is 96% Neural network model: To build a Neural network Model use the following code: >>> from keras.models import * >>> from keras.layers import Dense, Activation >>> from keras.optimizers import * model = Sequential() model.add(Dense(10, input_dim=9, activation="sigmoid")) model.add(Dense(10, activation='sigmoid')) model.add(Dense(1)) sgd = SGD(lr=0.01, decay=0.000001, momentum=0.9, nesterov=True) model.compile(optimizer=sgd, loss='mse') model.fit(Xdata, Ydata, nb_epoch=200, batch_size=100) Score = model.evaluate(XdataT, YdataT, verbose=0) Print(“The Score of the Neural Network is”, Score * 100 ) With this code, we imported the required Keras modules, we built the layers, we compiled the model with an SGD optimizer, we fit the model, and we printed out the score of the model. How to build a Twitter bot detector In the previous sections, we saw how to build a machine learning-based botnet detector. In this new project, we are going to deal with a different problem instead of defending against botnet malware. We are going to detect Twitter bots because they are also dangerous and can perform malicious actions. For the model, we are going to use the NYU Tandon Spring 2017 Machine Learning Competition: Twitter Bot classification dataset. You can download it from this link: https://www.kaggle.com/c/twitter-bot-classification/data. Import the required Python packages: >>> import pandas as pd >>> import numpy as np >>> import seaborn Let's load the data using pandas and highlight the bot and non-bot data: >>> data = pd.read_csv('training_data_2_csv_UTF.csv') >>> Bots = data[data.bot==1] >> NonBots = data[data.bot==0] Visualization with seaborn In every project, I want to help you discover new data visualization Python libraries because, as you saw, data engineering and visualization are essential to every modern data-centric project. This time, I chose seaborn to visualize the data and explore it before starting the training phase. Seaborn is a Python library for making statistical visualizations. The following is an example of generating a plot with seaborn: >>> data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000) >>> data = pd.DataFrame(data, columns=['x', 'y']) >>> for col in 'xy': ... seaborn.kdeplot(data[col], shade=True) For example, in our case, if we want to identify the missing data: matplotlib.pyplot.figure(figsize=(10,6)) seaborn.heatmap(data.isnull(), yticklabels=False, cbar=False, cmap='viridis') matplotlib.pyplot.tight_layout() The previous two code snippets were some examples to learn how to visualize data. Visualization helps data scientists to explore and learn more about the data. Now, let's go back and continue building our model. Identify the bag of words by selecting some bad words used by Twitter bots. The following is an example of bad words used by a bot. Of course, you can add more words: bag_of_words_bot = r'bot|b0t|cannabis|tweet me|mishear|follow me|updates every|gorilla|yes_ofc|forget' \ r'expos|kill|bbb|truthe|fake|anony|free|virus|funky|RNA|jargon' \ r'nerd|swag|jack|chick|prison|paper|pokem|xx|freak|ffd|dunia|clone|genie|bbb' \ r'ffd|onlyman|emoji|joke|troll|droop|free|every|wow|cheese|yeah|bio|magic|wizard|face' Now, it is time to identify training features: data['screen_name_binary'] = data.screen_name.str.contains(bag_of_words_bot, case=False, na=False) data['name_binary'] = data.name.str.contains(bag_of_words_bot, case=False, na=False) data['description_binary'] = data.description.str.contains(bag_of_words_bot, case=False, na=False) data['status_binary'] = data.status.str.contains(bag_of_words_bot, case=False, na=False) Feature extraction: Let's select features to use in our model: data['listed_count_binary'] = (data.listed_count>20000)==False features = ['screen_name_binary', 'name_binary', 'description_binary', 'status_binary', 'verified', 'followers_count', 'friends_count', 'statuses_count', 'listed_count_binary', 'bot'] Now, train the model with a decision tree classifier: from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score, roc_curve, auc from sklearn.model_selection import train_test_split We import some previously discussed modules: X = data[features].iloc[:,:-1] y = data[features].iloc[:,-1] We define the classifier: clf = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=50, min_samples_split=10) We split the classifier: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101) We fit the model: clf.fit(X_train, y_train) y_pred_train = clf.predict(X_train) y_pred_test = clf.predict(X_test) We print out the accuracy scores: print("Training Accuracy: %.5f" %accuracy_score(y_train, y_pred_train)) print("Test Accuracy: %.5f" %accuracy_score(y_test, y_pred_test)) Our model detects Twitter bots with an 88% detection rate, which is a good accuracy rate. This technique is not the only possible way to detect botnets. Researchers have proposed many other models based on different machine learning algorithms, such as Linear SVM and decision trees. All these techniques have an accuracy of 90%. Most studies showed that feature engineering was a key contributor to improving machine learning models. To study a real-world case, check out a paper called What we learn from learning - Understanding capabilities and limitations of machine learning in botnet attacks (https://arxiv.org/pdf/1805.01333.pdf), conducted by David Santana, Shan Suthaharan, and Somya Mohanty. Summary In this tutorial, we learned how to build a botnet detector and a Twitter botnet detecter with different machine learning algorithms. To become a master at penetration testing using machine learning with Python, check out this book Mastering Machine Learning for Penetration Testing Cisco and Huawei Routers hacked via backdoor attacks and botnets How to protect yourself from a botnet attack Tackle trolls with Machine Learning bots: Filtering out inappropriate content just got easy

0
0
41089

How-To Tutorials

article-image-understanding-amazon-machine-learning-workflow

Natasha Mathur

24 Aug 2018

11 min read

Understanding Amazon Machine Learning Workflow [ Tutorial ]

Natasha Mathur

24 Aug 2018

11 min read

This article presents an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. Amazon Machine Learning is an online service by Amazon Web Services (AWS) that does supervised learning for predictive analytics. Launched in April 2015 at the AWS Summit, Amazon ML joins a growing list of cloud-based machine learning services, such as Microsoft Azure, Google prediction, IBM Watson, Prediction IO, BigML, and many others. These online machine learning services form an offer commonly referred to as Machine Learning as a Service or MLaaS following a similar denomination pattern of other cloud-based services such as SaaS, PaaS, and IaaS respectively for Software, Platform, or Infrastructure as a Service. The Amazon ML workflow closely follows a standard Data Science workflow with steps: Extract the data and clean it up. Make it available to the algorithm. Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part. Select the best model by training several models on the training dataset and comparing their performances on the validation dataset. Use the best model for predictions on new data. This article is an excerpt taken from the book 'Effective Amazon Machine Learning' written by Alexis Perrier. As shown in the following Amazon ML menu, the service is built around four objects: Datasource ML model Evaluation Prediction The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. We will take a closer look at the Datasource and ML model. Amazon ML dataset For the rest of the article, we will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm. This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. In short, we do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable. We will randomly select 20% of the rows as the held-out subset to use for the prediction of previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor: By creating a new column with randomly generated numbers Sorting the spreadsheet by that column Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split) Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967. Note that it is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model. Loading the data on Amazon S3 Follow these steps to load the training and held-out datasets on S3: Go to your s3 console at https://console.aws.amazon.com/s3. Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration. Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu: Both files are small, only a few KB, and hosting costs should remain negligible for that exercise. Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on. Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv. Declaring a datasource Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default: As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user: Specifying a Datasource name is used to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot: Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents. Creating the datasource An Amazon ML datasource is composed of the following: The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3 The schema that contains information on the type of the variables contained in the CSV file: Categorical Text Numeric (real-valued) Binary It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML. At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has: Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot: Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following: The first line in the CSV file contains te column name The target is the weight We see here that Amazon ML has correctly inferred the following: sex is categorical age, height, and weight are numeric (continuous real values) Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data: predicted weight = a * age + b * height + c * sex Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are used when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model. You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target, and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model. The machine learning model We select the default parameters for the training and evaluation settings. Amazon ML will do the following: Create a step for data transformation based on the statistical properties it has inferred from the dataset Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially. The step will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed. The default advanced settings for the model are shown in the following screenshot: We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on. Regularization comes in 3 levels with a mild (10^-6), medium (10^-4), or aggressive (10^-02) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^-6) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set). Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending. At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation. It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following: { "splitting": { "percentBegin": 0, "percentEnd": 70 } } You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot: While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters: Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100. Making several passes over the training data while shuffling the samples before each path. At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made. At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version. Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation. The Amazon ML flow is smooth and facilitates the inherent data science loop: data, model, evaluation, and prediction. We looked at an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. We discussed two objects of the Amazon ML menu: Datasource and ML model. If you found this post useful, be sure to check out the book 'Effective Amazon Machine Learning' to learn about evaluation and prediction in Amazon ML along with other AWS ML concepts. Integrate applications with AWS services: Amazon DynamoDB & Amazon Kinesis [Tutorial] AWS makes Amazon Rekognition, its image recognition AI, available for Asia-Pacific developers

0
0
3573

article-image-setting-up-jasmine-for-unit-testing-in-angular

Natasha Mathur

24 Aug 2018

8 min read

Setting up Jasmine for Unit Testing in Angular [Tutorial]

Natasha Mathur

24 Aug 2018

8 min read

Web developers work hard to build a working application that they can be proud of. But how can we ensure a painless maintainability in the future? A comprehensive automated testing layer will become our lifeline once our application begins to scale up and we have to mitigate the impact of bugs caused by new functionalities colliding with the already existing ones. In this article, we will see why we need to perform unit testing and the process of setting up Jasmine for our tests. This tutorial is an excerpt taken from the book 'Learning Angular - second edition' written by Christoffer Noring, Pablo Deeleman. Why do we need unit tests? What is a unit test? Unit tests are part of an engineering philosophy that takes a stand for efficient and agile development processes, by adding an additional layer of automated testing to the code, before it is developed. The core concept is that each piece of code is delivered with its own test, and both pieces of code are built by the developer who is working on that code. First, we design the test against the module we want to deliver, checking the accuracy of its output and behavior. Since the module is still not implemented, the test will fail. Hence, our job is to build the module in such a way that it passes its own test. Unit testing is quite controversial. While there is a common agreement about how beneficial test-driven development for ensuring code quality and maintenance over time is, not everybody undertakes unit testing in their daily practice. Why is that? Well, building tests while we develop our code can feel like a burden sometimes, particularly when the test winds up being bigger in size than the piece of functionality it aims to test. However, the arguments favoring testing outnumber the arguments against it: Building tests contribute to better code design. Our code must conform to the test requirements and not the other way around. In that sense, if we try to test an existing piece of code and we find ourselves blocked at some point, chances are that the piece of code we aim to test is not well designed and shows off a convoluted interface that requires some rethinking. On the other hand, building testable modules can help with early detection of side effects on other modules. Refactoring tested code is the lifeline against introducing bugs in later stages. Any development is meant to evolve with time, and on every refactor the risk of introducing a bug, that will only pop up in another part of our application, is high. Unit tests are a good way to ensure that we catch bugs at an early stage, either when introducing new features or when updating existing ones. Building tests is a good way to document our code APIs and functionalities. And this becomes a priceless resource when someone not acquainted with the code base takes over the development endeavor. These are only a few arguments, but you can find countless resources on the web about the benefits of testing your code. If you do not feel convinced yet, give it a try. Otherwise, let's continue with our journey and see the overall form of a test. Unit testing in Jasmine There are many different ways to test a piece of code, but in this article, we will look at the anatomy of a test, what it is made up of. The first thing we need, for testing any code, is a test framework. The test framework should provide utility functions for building test suites, containing one or several test specs each. So what are these concepts? Test suite: A suite creates a logical grouping for a bunch of tests. A suite can, for example, be all the tests for a product page. Test spec: This is another name for a unit test. The following shows what a test file can look like where we are using a test suite and placing a number of related tests inside. The chosen framework for this is Jasmine. In Jasmine, the describe() function helps us to define a test suite. The describe() method takes a name as the first parameter and a function as the second parameter. Inside of the describe() function are a number of invocations to the it() method. The it() function is our unit test; it takes the name of the test as the first parameter and a function as the second parameter: // Test suite describe('A math library', () => { // Test spec it('add(1,1,) should return 2', () => { // Test spec implementation goes here }); }); Each test spec checks out a specific functionality of the feature described in the suite description argument and declares one or several expectations in its body. Each expectation takes a value, which we call the expected value, and is compared against an actual value by means of a matcher function, which checks whether expected and actual values match accordingly. This is what we call an assertion, and the test framework will pass or fail the spec depending on the result of such assertions. The code is as follows: // Test suite describe('A math library', () => { // Test spec it('add(1,1) should return 2', () => { // Test assertion expect(add(1,1,)).toBe(2); }); it('subtract(2,1)', () =>{ //Test assertion expect(subtract(2,1)).toBe(1); }) }); In the previous example, add(1,1) will return the actual value that is supposed to match the expected value declared in the toBe() matcher function. Worth noting from the previous example is the addition of a second test that tests our subtract() function. We can clearly see that this test deals with yet another mathematical operation, thus it makes sense to group both these tests under one suite. So far, we have learned about test suites and how to group tests according to their function. Furthermore, we have learned about invoking the code you want to test and asserting that it does what you think it does. There are, however, more concepts to a unit test worth knowing about, namely setup and teardown functionality. A setup functionality is something that sets up your code before the test is run usually. It's a way to keep your code cleaner so you can focus on just invoking the code and asserting. A tear-down functionality is the opposite of a setup functionality and is dedicated to tearing down what you set up initially; essentially it's a way to clean up after the test. Let's see how this can look in practice with a code example, using the Jasmine framework. In Jasmine, the beforeEach() method is used for setup functionality; it runs before every unit test. The afterEach() method is used to run tear-down logic. The code is as follows: describe('a Product service', () => { let productService; beforeEach(() => { productService = new ProductService(); }); it('should return data', () => { let actual = productService.getData(); assert(actual.length).toBe(1); }); afterEach(() => { productService = null; }); }); We can see in the preceding code how the beforeEach() function is responsible for instantiating the productService, which means the test only has to care about invoking production code and asserting the outcome. This makes the test look cleaner. It should be said, though, in reality, tests tend to have a lot of setup going on and having a beforeEach() function can really make the tests look cleaner; above all, it tends to make it easier to add new tests, which is great. What you want at the end of the day is well-tested code; the easier it is to write and maintain such code, the better for your software. Testing web applications in general and Angular applications, in particular, poses a myriad of scenarios that usually need a specific approach. Remember that if a specific test requires a cumbersome and convoluted solution, we are probably facing a good case for a module redesign instead. There are several paths to compound our knowledge of web application testing in Angular and enable us to become great testing ninjas. In this we saw the importance of unit testing in our Angular applications, the basic shape of a unit test, and the process of setting up Jasmine for our tests. If you found this post useful, do check out the book 'Learning Angular - Second Edition' to learn more about unit testing and how to implement it for routes, inputs, outputs, directives, etc. Everything new in Angular 6: Angular Elements, CLI commands and more Angular 6 is here packed with exciting new features! Getting started with Angular CLI and build your first Angular Component

0
0
14431

How-To Tutorials

article-image-decorators-as-higher-order-functions-python

Aaron Lazar

24 Aug 2018

7 min read

Using Decorators as higher-order functions in Python [Tutorial]

Aaron Lazar

24 Aug 2018

7 min read

0
0
9181

How-To Tutorials

article-image-creating-macros-in-rust-tutorial

Aaron Lazar

23 Aug 2018

14 min read

Creating Macros in Rust [Tutorial]

Aaron Lazar

23 Aug 2018

14 min read

Since Rust 1.0 has a great macro system, it allows us to apply some code to multiple types or expressions, as they work by expanding themselves at compile time. This means that when you use a macro, you are effectively writing a lot of code before the actual compilation starts. This has two main benefits, first, the codebase can be easier to maintain by being smaller and reusing code. Second, since macros expand before starting the creation of object code, you can abstract at the syntactic level. In this article, we'll learn how to create our very own macros in Rust. This Rust tutorial is an extract from Rust High Performance, authored by Iban Eguia Moraza. For example, you can have a function like this one: fn add_one(input: u32) -> u32 { input + 1 } This function restricts the input to u32 types and the return type to u32. We could add some more accepted types by using generics, which may accept &u32 if we use the Add trait. Macros allow us to create this kind of code for any element that can be written to the left of the + sign and it will be expanded differently for each type of element, creating a different code for each case. To create a macro, you will need to use a macro built into the language, the macro_rules!{} macro. This macro receives the name of the new macro as a first parameter and a block with the macro code as a second element. The syntax can be a bit complex the first time you see it, but it can be learned quickly. Let's start with a macro that does just the same as the function we saw before: macro_rules! add_one { ($input:expr) => { $input + 1 } } You can now call that macro from your main() function by calling add_one!(integer);. Note that the macro needs to be defined before the first call, even if it's in the same file. It will work with any integer, which wasn't possible with functions. Let's analyze how the syntax works. In the block after the name of the new macro (add_one), we can see two sections. In the first part, on the left of the =>, we see $input:expr inside parentheses. Then, to the right, we see a Rust block where we do the actual addition. The left part works similarly (in some ways) to a pattern match. You can add any combination of characters and then some variables, all of them starting with a dollar sign ($) and showing the type of variable after a colon. In this case, the only variable is the $input variable and it's an expression. This means that you can insert any kind of expression there and it will be written in the code to the right, substituting the variable with the expression. Creating Macro variants As you can see, it's not as complicated as you might think. As I wrote, you can have almost any pattern to the left of the macro_rules!{} side. Not only that, you can also have multiple patterns, as if it were a match statement, so that if one of them matches, it will be the one expanded. Let's see how this works by creating a macro which, depending on how we call it, will add one or two to the given integer: macro_rules! add { {one to $input:expr} => ($input + 1); {two to $input:expr} => ($input + 2); } fn main() { println!("Add one: {}", add!(one to 25/5)); println!("Add two: {}", add!(two to 25/5)); } You can see a couple of clear changes to the macro. First, we swapped braces for parentheses and parentheses for braces in the macro. This is because in a macro, you can use interchangeable braces ({ and }), square brackets ([ and ]), and parentheses (( and )). Not only that, you can use them when calling the macro. You have probably already used the vec![] macro and the format!() macro, and we saw the lazy_static!{} macro in the last chapter. We use brackets and parentheses here just for convention, but we could call the vec!{} or the format![] macros the same way, because we can use braces, brackets, and parentheses in any macro call. The second change was to add some extra text to our left-hand side patterns. We now call our macro by writing the text one to or two to, so I also removed the one redundancy to the macro name and called it add!(). This means that we now call our macro with literal text. That is not valid Rust, but since we are using a macro, we modify the code we are writing before the compiler tries to understand actual Rust code and the generated code is valid. We could add any text that does not end the pattern (such as parentheses or braces) to the pattern. The final change was to add a second possible pattern. We can now add one or two and the only difference will be that the right side of the macro definition must now end with a trailing semicolon for each pattern (the last one is optional) to separate each of the options. A small detail that I also added in the example was when calling the macro in the main() function. As you can see, I could have added one or two to 5, but I wrote 25/5 for a reason. When compiling this code, this will be expanded to 25/5 + 1 (or 2, if you use the second variant). This will later be optimized at compile time, since it will know that 25/5 + 1 is 6, but the compiler will receive that expression, not the final result. The macro system will not calculate the result of the expression; it will simply copy in the resulting code whatever you give to it and then pass it to the next compiler phase. You should be especially careful with this when a macro you are creating calls another macro. They will get expanded recursively, one inside the other, so the compiler will receive a bunch of final Rust code that will need to be optimized. Issues related to this were found in the CLAP crate that we saw in the last chapter, since the exponential expansions were adding a lot of bloat code to their executables. Once they found out that there were too many macro expansions inside the other macros and fixed it, they reduced the size of their binary contributions by more than 50%. Macros allow for an extra layer of customization. You can repeat arguments more than once. This is common, for example, in the vec![] macro, where you create a new vector with information at compile time. You can write something like vec![3, 4, 76, 87];. How does the vec![] macro handle an unspecified number of arguments? Creating Complex macros We can specify that we want multiple expressions in the left-hand side pattern of the macro definition by adding a * for zero or more matches or a + for one or more matches. Let's see how we can do that with a simplified my_vec![] macro: macro_rules! my_vec { ($($x: expr),*) => {{ let mut vector = Vec::new(); $(vector.push($x);)* vector }} } Let's see what is happening here. First, we see that on the left side, we have two variables, denoted by the two $ signs. The first makes reference to the actual repetition. Each comma-separated expression will generate a $x variable. Then, on the right side, we use the various repetitions to push $x to the vector once for every expression we receive. There is another new thing on the right-hand side. As you can see, the macro expansion starts and ends with a double brace instead of using only one. This is because, once the macro gets expanded, it will substitute the given expression for a new expression: the one that gets generated. Since what we want is to return the vector we are creating, we need a new scope where the last sentence will be the value of the scope once it gets executed. You will be able to see it more clearly in the next code snippet. We can call this code with the main() function: fn main() { let my_vector = my_vec![4, 8, 15, 16, 23, 42]; println!("Vector test: {:?}", my_vector); } It will be expanded to this code: fn main() { let my_vector = { let mut vector = Vec::new(); vector.push(4); vector.push(8); vector.push(15); vector.push(16); vector.push(23); vector.push(42); vector }; println!("Vector test: {:?}", my_vector); } As you can see, we need those extra braces to create the scope that will return the vector so that it gets assigned to the my_vector binding. You can have multiple repetition patterns on the left expression and they will be repeated for every use, as needed on the right. macro_rules! add_to_vec { ($( $x:expr; [ $( $y:expr ),* ]);* ) => { &[ $($( $x + $y ),*),* ] } } In this example, the macro can receive one or more $x; [$y1, $y2,...] input. So, for each input, it will have one expression, then a semicolon, then a bracket with multiple sub-expressions separated by a comma, and finally, another bracket and a semicolon. But what does the macro do with this input? Let's check to the right-hand side of it. As you can see, this will create multiple repetitions. We can see that it creates a slice (&[T]) of whatever we feed to it, so all the expressions we use must be of the same type. Then, it will start iterating over all $x variables, one per input group. So if we feed it only one input, it will iterate once for the expression to the left of the semicolon. Then, it will iterate once for every $y expression associated with the $x expression, add them to the + operator, and include the result in the slice. If this was too complex to understand, let's look at an example. Let's suppose we call the macro with 65; [22, 34] as input. In this case, 65 will be $x, and 22, 24, and so on will be $y variables associated with 65. So, the result will be a slice like this: &[65+22, 65+34]. Or, if we calculate the results: &[87, 99]. If, on the other hand, we give two groups of variables by using 65; [22, 34]; 23; [56, 35] as input, in the first iteration, $x will be 65, while in the second one, it will be 23. The $y variables of 64 will be 22 and 34, as before, and the ones associated with 23 will be 56 and 35. This means that the final slice will be &[87, 99, 79, 58], where 87 and 99 work the same way as before and 79 and 58 are the extension of adding 23 to 56 and 23 to 35. This gives you much more flexibility than the functions, but remember, all this will be expanded during compile time, which can make your compilation time much slower and the final codebase larger and slower still if the macro used duplicates too much code. In any case, there is more flexibility to it yet. So far, all variables have been of the expr kind. We have used this by declaring $x:expr and $y:expr but, as you can imagine, there are other kinds of macro variables. The list follows: expr: Expressions that you can write after an = sign, such as 76+4 or if a==1 {"something"} else {"other thing"}. ident: An identifier or binding name, such as foo or bar. path: A qualified path. This will be a path that you could write in a use sentence, such as foo::bar::MyStruct or foo::bar::my_func. ty: A type, such as u64 or MyStruct. It can also be a path to the type. pat: A pattern that you can write at the left side of an = sign or in a match expression, such as Some(t) or (a, b, _). stmt: A full statement, such as a let binding like let a = 43;. block: A block element that can have multiple statements and a possible expression between braces, such as {vec.push(33); vec.len()}. item: What Rust calls items. For example, function or type declarations, complete modules, or trait definitions. meta: A meta element, which you can write inside of an attribute (#[]). For example, cfg(feature = "foo"). tt: Any token tree that will eventually get parsed by a macro pattern, which means almost anything. This is useful for creating recursive macros, for example. As you can imagine, some of these kinds of macro variables overlap and some of them are just more specific than the others. The use will be verified on the right-hand side of the macro, in the expansion, since you might try to use a statement where an expression must be used, even though you might use an identifier too, for example. There are some extra rules, too, as we can see in the Rust documentation (https://doc.rust-lang.org/book/first-edition/macros.html#syntactic-requirements). Statements and expressions can only be followed by =>, a comma, or a semicolon. Types and paths can only be followed by =>, the as or where keywords, or any commas, =, |, ;, :, >, [, or {. And finally, patterns can only be followed by =>, the if or in keywords, or any commas, =, or |. Let's put this in practice by implementing a small Mul trait for a currency type we can create. This is an adapted example of some work we did when creating the Fractal Credits digital currency. In this case, we will look to the implementation of the Amount type (https://github.com/FractalGlobal/utils-rs/blob/49955ead9eef2d9373cc9386b90ac02b4d5745b4/src/amount.rs#L99-L102), which represents a currency amount. Let's start with the basic type definition: #[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord)] pub struct Amount { value: u64, } This amount will be divisible by up to three decimals, but it will always be an exact value. We should be able to add an Amount to the current Amount, or to subtract it. I will not explain these trivial implementations, but there is one implementation where macros can be of great help. We should be able to multiply the amount by any positive integer, so we should implement the Mul trait for u8, u16, u32, and u64 types. Not only that, we should be able to implement the Div and the Rem traits, but I will leave those out, since they are a little bit more complex. You can check them in the implementation linked earlier. The only thing the multiplication of an Amount with an integer should do is to multiply the value by the integer given. Let's see a simple implementation for u8: use std::ops::Mul; impl Mul<u8> for Amount { type Output = Self; fn mul(self, rhs: u8) -> Self::Output { Self { value: self.value * rhs as u64 } } } impl Mul<Amount> for u8 { type Output = Amount; fn mul(self, rhs: Amount) -> Self::Output { Self::Output { value: self as u64 * rhs.value } } } As you can see, I implemented it both ways so that you can put the Amount to the left and to the right of the multiplication. If we had to do this for all integers, it would be a big waste of time and code. And if we had to modify one of the implementations (especially for Rem functions), it would be troublesome to do it in multiple code points. Let's use macros to help us. We can define a macro, impl_mul_int!{}, which will receive a list of integer types and then implement the Mul trait back and forward between all of them and the Amount type. Let's see: macro_rules! impl_mul_int { ($($t:ty)*) => ($( impl Mul<$t> for Amount { type Output = Self; fn mul(self, rhs: $t) -> Self::Output { Self { value: self.value * rhs as u64 } } } impl Mul<Amount> for $t { type Output = Amount; fn mul(self, rhs: Amount) -> Self::Output { Self::Output { value: self as u64 * rhs.value } } } )*) } impl_mul_int! { u8 u16 u32 u64 usize } As you can see, we specifically ask for the given elements to be types and then we implement the trait for all of them. So, for any code that you want to implement for multiple types, you might as well try this approach, since it will save you from writing a lot of code and it will make it more maintainable. If you found this article useful and would like to learn more such tips, head on over to pick up the book, Rust High Performance, authored by Iban Eguia Moraza. Perform Advanced Programming with Rust Rust 1.28 is here with global allocators, nonZero types and more Eclipse IDE's Photon release will support Rust

0
0
35336

How-To Tutorials

article-image-train-convolutional-neural-network-in-keras-improve-with-data-augmentation

Amey Varangaonkar

23 Aug 2018

10 min read

Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial]

Amey Varangaonkar

23 Aug 2018

10 min read

0
0
23624

How-To Tutorials

article-image-shared-pointers-in-rust-challenges-solutions

Aaron Lazar

22 Aug 2018

8 min read

Working with Shared pointers in Rust: Challenges and Solutions [Tutorial]

Aaron Lazar

22 Aug 2018

8 min read

One of Rust's most criticized problem is that it's difficult to develop an application with shared pointers. It's true that due to Rust's memory safety guarantees, it might be difficult to develop those kind of algorithms, but as we will see now, the standard library gives us types we can use to safely allow that behavior. In this article, we'll understand how to overcome the issue of shared pointers in Rust to increase efficiency. This article is an extract from Rust High Performance, authored by Iban Eguia Moraza. Overcoming issue with cell module The standard Rust library has one interesting module, the std::cell module, that allows us to use objects with interior mutability. This means that we can have an immutable object and still mutate it by getting a mutable borrow to the underlying data. This, of course, would not comply with the mutability rules we saw before, but the cells make sure this works by checking the borrows at runtime or by doing copies of the underlying data. Cells Let's start with the basic Cell structure. A Cell will contain a mutable value, but it can be mutated without having a mutable Cell. It has mainly three interesting methods: set(), swap(), and replace(). The first allows us to set the contained value, replacing it with a new value. The previous structure will be dropped (the destructor will run). That last bit is the only difference with the replace() method. In the replace() method, instead of dropping the previous value, it will be returned. The swap() method, on the other hand, will take another Cell and swap the values between the two. All this without the Cell needing to be mutable. Let's see it with an example: use std::cell::Cell; #[derive(Copy, Clone)] struct House { bedrooms: u8, } impl Default for House { fn default() -> Self { House { bedrooms: 1 } } } fn main() { let my_house = House { bedrooms: 2 }; let my_dream_house = House { bedrooms: 5 }; let my_cell = Cell::new(my_house); println!("My house has {} bedrooms.", my_cell.get().bedrooms); my_cell.set(my_dream_house); println!("My new house has {} bedrooms.", my_cell.get().bedrooms); let my_new_old_house = my_cell.replace(my_house); println!( "My house has {} bedrooms, it was better with {}", my_cell.get().bedrooms, my_new_old_house.bedrooms ); let my_new_cell = Cell::new(my_dream_house); my_cell.swap(&my_new_cell); println!( "Yay! my current house has {} bedrooms! (my new house {})", my_cell.get().bedrooms, my_new_cell.get().bedrooms ); let my_final_house = my_cell.take(); println!( "My final house has {} bedrooms, the shared one {}", my_final_house.bedrooms, my_cell.get().bedrooms ); } As you can see in the example, to use a Cell, the contained type must be Copy. If the contained type is not Copy, you will need to use a RefCell, which we will see next. Continuing with this Cell example, as you can see through the code, the output will be the following: So we first create two houses, we select one of them as the current one, and we keep mutating the current and the new ones. As you might have seen, I also used the take() method, only available for types implementing the Default trait. This method will return the current value, replacing it with the default value. As you can see, you don't really mutate the value inside, but you replace it with another value. You can either retrieve the old value or lose it. Also, when using the get() method, you get a copy of the current value, and not a reference to it. That's why you can only use elements implementing Copy with a Cell. This also means that a Cell does not need to dynamically check borrows at runtime. RefCell RefCell is similar to Cell, except that it accepts non-Copy data. This also means that when modifying the underlying object, it cannot simply copy it when returning it, it will need to return references. The same way, when you want to mutate the object inside, it will return a mutable reference. This only works because it will dynamically check at runtime whether a borrow exists before returning a mutable borrow, or the other way around, and if it does, the thread will panic. Instead of using the get() method as in Cell, RefCell has two methods to get the underlying data: borrow() and borrow_mut(). The first will get a read-only borrow, and you can have as many immutable borrows in a scope. The second one will return a read-write borrow, and you will only be able to have one in scope to follow the mutability rules. If you try to do a borrow_mut() after a borrow() in the same scope, or a borrow() after a borrow_mut(), the thread will panic. There are two non-panicking alternatives to these borrows: try_borrow() and try_borrow_mut(). These two will try to borrow the data (the first read-only and the second read/write), and if there are incompatible borrows present, they will return a Result::Err, so that you can handle the error without panicking. Both Cell and RefCell have a get_mut() method, that will get a mutable reference to the element inside, but it requires the Cell / RefCell to be mutable, so it doesn't make much sense if you need the Cell / RefCell to be immutable. Nevertheless, if in a part of the code you can actually have a mutable Cell / RefCell, you should use this method to change the contents, since it will check all rules statically at compile time, without runtime overhead. Interestingly enough, RefCell does not return a plain reference to the underlying data when we call borrow() or borrow_mut(). You would expect them to return &T and &mut T (where T is the wrapped element). Instead, they will return a Ref and a RefMut, respectively. This is to safely wrap the reference inside, so that the lifetimes get correctly calculated by the compiler without requiring references to live for the whole lifetime of the RefCell. They implement Deref into references, though, so thanks to Rust's Deref coercion, you can use them as references. Overcoming issue with rc module The std::rc module contains reference-counted pointers that can be used in single-threaded applications. They have very little overhead, thanks to counters not being atomic counters, but this means that using them in multithreaded applications could cause data races. Thus, Rust will stop you from sending them between threads at compile time. There are two structures in this module: Rc and Weak. An Rc is an owning pointer to the heap. This means that it's the same as a Box, except that it allows for reference-counted pointers. When the Rc goes out of scope, it will decrease by 1 the number of references, and if that count is 0, it will drop the contained object. Since an Rc is a shared reference, it cannot be mutated, but a common pattern is to use a Cell or a RefCell inside the Rc to allow for interior mutability. Rc can be downgraded to a Weak pointer, that will have a borrowed reference to the heap. When an Rc drops the value inside, it will not check whether there are Weak pointers to it. This means that a Weak pointer will not always have a valid reference, and therefore, for safety reasons, the only way to check the value of the Weak pointer is to upgrade it to an Rc, which could fail. The upgrade() method will return None if the reference has been dropped. Let's check all this by creating an example binary tree structure: use std::cell::RefCell; use std::rc::{Rc, Weak}; struct Tree<T> { root: Node<T>, } struct Node<T> { parent: Option<Weak<Node<T>>>, left: Option<Rc<RefCell<Node<T>>>>, right: Option<Rc<RefCell<Node<T>>>>, value: T, } In this case, the tree will have a root node, and each of the nodes can have up to two children. We call them left and right, because they are usually represented as trees with one child on each side. Each node has a pointer to one of the children, and it owns the children nodes. This means that when a node loses all references, it will be dropped, and with it, its children. Each child has a pointer to its parent. The main issue with this is that, if the child has an Rc pointer to its parent, it will never drop. This is a circular dependency, and to avoid it, the pointer to the parent will be a Weak pointer. So, you've finally understood how Rust manages shared pointers for complex structures, where the Rust borrow checker can make your coding experience much more difficult. If you found this article useful and would like to learn more such tips, head over to pick up the book, Rust High Performance, authored by Iban Eguia Moraza. Perform Advanced Programming with Rust Rust 1.28 is here with global allocators, nonZero types and more Say hello to Sequoia: a new Rust based OpenPGP library to secure your apps

0
0
18050

How-To Tutorials

article-image-generative-adversarial-networks-using-keras

Amey Varangaonkar

21 Aug 2018

12 min read

Generative Adversarial Networks: Generate images using Keras GAN [Tutorial]

Amey Varangaonkar

21 Aug 2018

12 min read

You might have worked with the popular MNIST dataset before - but in this article, we will be generating new MNIST-like images with a Keras GAN. It can take a very long time to train a GAN; however, this problem is small enough to run on most laptops in a few hours, which makes it a great example. The following excerpt is taken from the book Deep Learning Quick Reference, authored by Mike Bernico. The network architecture that we will be using here has been found by, and optimized by, many folks, including the authors of the DCGAN paper and people like Erik Linder-Norén, who's excellent collection of GAN implementations called Keras GAN served as the basis of the code we used here. Loading the MNIST dataset The MNIST dataset consists of 60,000 hand-drawn numbers, 0 to 9. Keras provides us with a built-in loader that splits it into 50,000 training images and 10,000 test images. We will use the following code to load the dataset: from keras.datasets import mnist def load_data(): (X_train, _), (_, _) = mnist.load_data() X_train = (X_train.astype(np.float32) - 127.5) / 127.5 X_train = np.expand_dims(X_train, axis=3) return X_train As you probably noticed, We're not returning any of the labels or the testing dataset. We're only going to use the training dataset. The labels aren't needed because the only labels we will be using are 0 for fake and 1 for real. These are real images, so they will all be assigned a label of 1 at the discriminator. Building the generator The generator uses a few new layers that we will talk about in this section. First, take a chance to skim through the following code: def build_generator(noise_shape=(100,)): input = Input(noise_shape) x = Dense(128 * 7 * 7, activation="relu")(input) x = Reshape((7, 7, 128))(x) x = BatchNormalization(momentum=0.8)(x) x = UpSampling2D()(x) x = Conv2D(128, kernel_size=3, padding="same")(x) x = Activation("relu")(x) x = BatchNormalization(momentum=0.8)(x) x = UpSampling2D()(x) x = Conv2D(64, kernel_size=3, padding="same")(x) x = Activation("relu")(x) x = BatchNormalization(momentum=0.8)(x) x = Conv2D(1, kernel_size=3, padding="same")(x) out = Activation("tanh")(x) model = Model(input, out) print("-- Generator -- ") model.summary() return model We have not previously used the UpSampling2D layer. This layer will take increases in the rows and columns of the input tensor, leaving the channels unchanged. It does this by repeating the values in the input tensor. By default, it will double the input. If we give an UpSampling2D layer a 7 x 7 x 128 input, it will give us a 14 x 14 x 128 output. Typically when we build a CNN, we start with an image that is very tall and wide and uses convolutional layers to get a tensor that's very deep but less tall and wide. Here we will do the opposite. We'll use a dense layer and a reshape to start with a 7 x 7 x 128 tensor and then, after doubling it twice, we'll be left with a 28 x 28 tensor. Since we need a grayscale image, we can use a convolutional layer with a single unit to get a 28 x 28 x 1 output. This sort of generator arithmetic is a little off-putting and can seem awkward at first but after a few painful hours, you will get the hang of it! Building the discriminator The discriminator is really, for the most part, the same as any other CNN. Of course, there are a few new things that we should talk about. We will use the following code to build the discriminator: def build_discriminator(img_shape): input = Input(img_shape) x =Conv2D(32, kernel_size=3, strides=2, padding="same")(input) x = LeakyReLU(alpha=0.2)(x) x = Dropout(0.25)(x) x = Conv2D(64, kernel_size=3, strides=2, padding="same")(x) x = ZeroPadding2D(padding=((0, 1), (0, 1)))(x) x = (LeakyReLU(alpha=0.2))(x) x = Dropout(0.25)(x) x = BatchNormalization(momentum=0.8)(x) x = Conv2D(128, kernel_size=3, strides=2, padding="same")(x) x = LeakyReLU(alpha=0.2)(x) x = Dropout(0.25)(x) x = BatchNormalization(momentum=0.8)(x) x = Conv2D(256, kernel_size=3, strides=1, padding="same")(x) x = LeakyReLU(alpha=0.2)(x) x = Dropout(0.25)(x) x = Flatten()(x) out = Dense(1, activation='sigmoid')(x) model = Model(input, out) print("-- Discriminator -- ") model.summary() return model First, you might notice the oddly shaped zeroPadding2D() layer. After the second convolution, our tensor has gone from 28 x 28 x 3 to 7 x 7 x 64. This layer just gets us back into an even number, adding zeros on one side of both the rows and columns so that our tensor is now 8 x 8 x 64. More unusual is the use of both batch normalization and dropout. Typically, these two layers are not used together; however, in the case of GANs, they do seem to benefit the network. Building the stacked model Now that we've assembled both the generator and the discriminator, we need to assemble a third model that is the stack of both models together that we can use for training the generator given the discriminator loss. To do that we can just create a new model, this time using the previous models as layers in the new model, as shown in the following code: discriminator = build_discriminator(img_shape=(28, 28, 1)) generator = build_generator() z = Input(shape=(100,)) img = generator(z) discriminator.trainable = False real = discriminator(img) combined = Model(z, real) Notice that we're setting the discriminator's training attribute to False before building the model. This means that for this model we will not be updating the weights of the discriminator during backpropagation. We will freeze these weights and only move the generator weights with the stack. The discriminator will be trained separately. Now that all the models are built, they need to be compiled, as shown in the following code: gen_optimizer = Adam(lr=0.0002, beta_1=0.5) disc_optimizer = Adam(lr=0.0002, beta_1=0.5) discriminator.compile(loss='binary_crossentropy', optimizer=disc_optimizer, metrics=['accuracy']) generator.compile(loss='binary_crossentropy', optimizer=gen_optimizer) combined.compile(loss='binary_crossentropy', optimizer=gen_optimizer) If you'll notice, we're creating two custom Adam optimizers. This is because many times we will want to change the learning rate for only the discriminator or generator, slowing one or the other down so that we end up with a stable GAN where neither is overpowering the other. You'll also notice that we're using beta_1 = 0.5. This is a recommendation from the original DCGAN paper that we've carried forward and also had success with. A learning rate of 0.0002 is a good place to start as well, and was found in the original DCGAN paper. The training loop We have previously had the luxury of calling .fit() on our model and letting Keras handle the painful process of breaking the data apart into mini batches and training for us. Unfortunately, because we need to perform the separate updates for the discriminator and the stacked model together for a single batch we're going to have to do things the old-fashioned way, with a few loops. This is how things used to be done all the time, so while it's perhaps a little more work, it does admittedly leave me feeling nostalgic. The following code illustrates the training technique: num_examples = X_train.shape[0] num_batches = int(num_examples / float(batch_size)) half_batch = int(batch_size / 2) for epoch in range(epochs + 1): for batch in range(num_batches): # noise images for the batch noise = np.random.normal(0, 1, (half_batch, 100)) fake_images = generator.predict(noise) fake_labels = np.zeros((half_batch, 1)) # real images for batch idx = np.random.randint(0, X_train.shape[0], half_batch) real_images = X_train[idx] real_labels = np.ones((half_batch, 1)) # Train the discriminator (real classified as ones and generated as zeros) d_loss_real = discriminator.train_on_batch(real_images, real_labels) d_loss_fake = discriminator.train_on_batch(fake_images, fake_labels) d_loss = 0.5 * np.add(d_loss_real, d_loss_fake) noise = np.random.normal(0, 1, (batch_size, 100)) # Train the generator g_loss = combined.train_on_batch(noise, np.ones((batch_size, 1))) # Plot the progress print("Epoch %d Batch %d/%d [D loss: %f, acc.: %.2f%%] [G loss: %f]" % (epoch,batch, num_batches, d_loss[0], 100 * d_loss[1], g_loss)) if batch % 50 == 0: save_imgs(generator, epoch, batch) There is a lot going on here, to be sure. As before, let's break it down block by block. First, let's see the code to generate noise vectors: noise = np.random.normal(0, 1, (half_batch, 100)) fake_images = generator.predict(noise) fake_labels = np.zeros((half_batch, 1)) This code is generating a matrix of noise vectors called z) and sending it to the generator. It's getting a set of generated images back, which we're calling fake images. We will use these to train the discriminator, so the labels we want to use are 0s, indicating that these are in fact generated images. Note that the shape here is half_batch x 28 x 28 x 1. The half_batch is exactly what you think it is. We're creating half a batch of generated images because the other half of the batch will be real data, which we will assemble next. To get our real images, we will generate a random set of indices across X_train and use that slice of X_train as our real images, as shown in the following code: idx = np.random.randint(0, X_train.shape[0], half_batch) real_images = X_train[idx] real_labels = np.ones((half_batch, 1)) Yes, we are sampling with replacement in this case. It does work out but it's probably not the best way to implement minibatch training. It is, however, probably the easiest and most common. Since we are using these images to train the discriminator, and because they are real images, we will assign them 1s as labels, rather than 0s. Now that we have our discriminator training set assembled, we will update the discriminator. Also, note that we aren't using the soft labels. That's because we want to keep things as easy as they can be to understand. Luckily the network doesn't require them in this case. We will use the following code to train the discriminator: # Train the discriminator (real classified as ones and generated as zeros) d_loss_real = discriminator.train_on_batch(real_images, real_labels) d_loss_fake = discriminator.train_on_batch(fake_images, fake_labels) d_loss = 0.5 * np.add(d_loss_real, d_loss_fake) Notice that here we're using the discriminator's train_on_batch() method. The train_on_batch() method does exactly one round of forward and backward propagation. Every time we call it, it updates the model once from the model's previous state. Also, notice that we're making the update for the real images and fake images separately. This is advice that is given on the GAN hack Git we had previously referenced in the Generator architecture section. Especially in the early stages of training, when real images and fake images are from radically different distributions, batch normalization will cause problems with training if we were to put both sets of data in the same update. Now that the discriminator has been updated, it's time to update the generator. This is done indirectly by updating the combined stack, as shown in the following code: noise = np.random.normal(0, 1, (batch_size, 100)) g_loss = combined.train_on_batch(noise, np.ones((batch_size, 1))) To update the combined model, we create a new noise matrix, and this time it will be as large as the entire batch. We will use that as an input to the stack, which will cause the generator to generate an image and the discriminator to evaluate that image. Finally, we will use the label of 1 because we want to backpropagate the error between a real image and the generated image. Lastly, the training loop reports the discriminator and generator loss at the epoch/batch and then, every 50 batches, of every epoch we will use save_imgs to generate example images and save them to disk, as shown in the following code: print("Epoch %d Batch %d/%d [D loss: %f, acc.: %.2f%%] [G loss: %f]" % (epoch,batch, num_batches, d_loss[0], 100 * d_loss[1], g_loss)) if batch % 50 == 0: save_imgs(generator, epoch, batch) The save_imgs function uses the generator to create images as we go, so we can see the fruits of our labor. We will use the following code to define save_imgs: def save_imgs(generator, epoch, batch): r, c = 5, 5 noise = np.random.normal(0, 1, (r * c, 100)) gen_imgs = generator.predict(noise) gen_imgs = 0.5 * gen_imgs + 0.5 fig, axs = plt.subplots(r, c) cnt = 0 for i in range(r): for j in range(c): axs[i, j].imshow(gen_imgs[cnt, :, :, 0], cmap='gray') axs[i, j].axis('off') cnt += 1 fig.savefig("images/mnist_%d_%d.png" % (epoch, batch)) plt.close() It uses only the generator by creating a noise matrix and retrieving an image matrix in return. Then, using matplotlib.pyplot, it saves those images to disk in a 5 x 5 grid. Performing model evaluation Good is somewhat subjective when you're building a deep neural network to create images. Let's take a look at a few examples of the training process, so you can see for yourself how the GAN begins to learn to generate MNIST. Here's the network at the very first batch of the very first epoch. Clearly, the generator doesn't really know anything about generating MNIST at this point; it's just noise, as shown in the following image: But just 50 batches in, something is happening, as you can see from the following image: And after 200 batches of epoch 0 we can almost see numbers, as you can see from the following image: And here's our generator after one full epoch. These generated numbers look pretty good, and we can see how the discriminator might be fooled by them. At this point, we could probably continue to improve a little bit, but it looks like our GAN has worked as the computer is generating some pretty convincing MNIST digits, as shown in the following image: Thus, we see the power of GANs in action when it comes to image generation using the Keras library. If you found the above article to be useful, make sure you check out our book Deep Learning Quick Reference, for more such interesting coverage of popular deep learning concepts and their practical implementation. Keras 2.2.0 releases! 2 ways to customize your deep learning models with Keras How to build Deep convolutional GAN using TensorFlow and Keras

0
3
58913

How-To Tutorials

Build intelligent interfaces with CoreML using a CNN [Tutorial]

Getting started with Amazon Machine Learning workflow [Tutorial]

Why use JavaScript for machine learning?

How to build a real-time data pipeline for web developers - Part 2 [Tutorial]

How to build a real-time data pipeline for web developers - Part 1 [Tutorial]

Intelligent mobile projects with TensorFlow: Build a basic Raspberry Pi robot that listens, moves, sees, and speaks [Tutorial]

Unit testing Angular components and classes [Tutorial]

Build botnet detectors using machine learning algorithms in Python [Tutorial]

Understanding Amazon Machine Learning Workflow [ Tutorial ]

Setting up Jasmine for Unit Testing in Angular [Tutorial]

Trending Topics

Using Decorators as higher-order functions in Python [Tutorial]

Creating Macros in Rust [Tutorial]

Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial]

Working with Shared pointers in Rust: Challenges and Solutions [Tutorial]

Generative Adversarial Networks: Generate images using Keras GAN [Tutorial]

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access