Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-questions-tensorflow-2-0-tf-prebuilt-binaries-tensorboard-keras-python-support
Sugandha Lahoti
10 Dec 2019
5 min read
Save for later

#AskTensorFlow: Twitterati ask questions on TensorFlow 2.0 - TF prebuilt binaries, Tensorboard, Keras, and Python support

Sugandha Lahoti
10 Dec 2019
5 min read
TensorFlow 2.0 was released recently with tighter integration with Keras, eager execution enabled by default, three times faster training performance, a cleaned-up API, and more.  TensorFlow 2.0 had a major API Cleanup. Many API symbols are removed or renamed for better consistency and clarity. It now enables eager execution by default which effectively means that your TensorFlow code runs like numpy code. Keras has been introduced as the main high-level API to enable developers to easily leverage Keras’ various model-building APIs. TensorFlow 2.0 also has the SavedModel API that allows you to save your trained Machine learning model into a language-neutral format.  In May, Paige Bailey, Product Manager (TensorFlow) and Laurence Moroney,  Developer Advocate at Google sat down to discuss frequently asked questions on TensorFlow 2.0. They talked about TensorFlow prebuilt binaries, the TF 2.0 upgrade script, Tensorflow Datasets, and Python support. Can I ask about any prebuilt binary for the RTX 2080 GPU on Ubuntu 16?  Prebuilt binaries for TensorFlow tend to be associated with a specific driver from Nvidia. If you're taking a look at any of the prebuilt binaries, take a look at what driver or what version of the driver you have supported on that specific card. It's easy for you to go to the driver vendor and download the latest version. But that may not be the one that TensorFlow is built for or the one that it supports. So, just make sure that they actually match each other.  Do my TensorFlow scripts work with TensorFlow 2.0?  Generally, TensorFlow scripts do not work with TensorFlow 2.0. But TensorFlow 2.0 has created an upgrade utility that is automatically downloaded with TensorFlow 2.0. For more information, you can check out this medium blog post that Paige and her colleague Anna created. It shows how you can upgrade script on an end file - any arbitrary Python file or even Jupyter Notebooks. It'll give you an export.txt file that shows you all of the symbol renames, the added keywords, and then some manual changes.  When will TensorFlow be supported in Python 3.7 and hence be accessed in Anaconda 3? TensorFlow has made the commitment that as of January 1, 2020, they no longer support Python 2. They are firmly committed to Python 3 and Python 3 support.  Is it possible to run Tensorboard on colabs? You can run Tensorboard on colabs and do different operations like smoothing, changing some of the values, and using the embedding visualizer directly from your collab notebook in order to understand accuracies and to be able to model performance debugging. You also don't have to specify ports which means you need not remember to have multiple tensor board instances running. Tensorboard automatically selects one that would be a good candidate.  How would you use [TensorFlow’s] feature_columns with Keras? TensorFlow's feature_columns API is quite useful for non-numerical feature processing. Feature columns are a way of getting your data efficiently into Estimators and you can use them in Keras. TensorFlow 2.0 also has a migration guide if you wanted to migrate your models from using Estimators to being more of a TensorFlow 2.0 format with Keras.   What are some simple data sets for testing and comparing different training methods for artificial neural networks? Are there any in TensorFlow 2.0? Although MNIST and Fashion-MNIST are great, TensorFlow 2.0 also has TensorFlow Datasets which provide a collection of datasets ready to use with TensorFlow. It handles downloading and preparing the data and constructing a tf.data. TensorFlow Datasets is compatible with both TensorFlow Eager mode and Graph mode. Also, you can use them with all of your deep learning and machine learning models with just a few lines of code.  What about all the web developers who are new to AI, how does TensorFlow 2.0 help them get started? With TensorFlow 2.0, the web models that you create using saved model can be deployed to TFLite, or TensorFlow.js. The Keras layers are also supported in TensorFlow.js, so it's not just for Python developers but also for JS developers or even R developers.  You can watch Paige and Lawrence answering more questions in this three-part video series available on YouTube. Some of the other  questions asked were: Is there any TensorFlow.js transfer learning example for object detection? Are you going to publish the updated version of TensorFlow from Poets tutorial from Pete Warden implementing TF2.0. TFLite 2.0 and NN-API for faster inference on Android devices equipped with NPU/DSP? Will the frozen graph generated from TF 1.x work on TF 2.0? Which is the preferred format for saving the model GOIU forward saved_model (SM) or hd5? What is the purpose of keeping Estimators and Keras as separate APIs?  If you want to quickly start with building machine learning projects with TensorFlow 2.0, read our book TensorFlow 2.0 Quick Start Guide by Tony Holdroyd. In this book, you will get acquainted with some new practices introduced in TensorFlow 2.0. You will also learn to train your own models for effective prediction, using high-level Keras API.  TensorFlow.js contributor Kai Sasaki on how TensorFlow.js eases web-based machine learning application development Introducing Spleeter, a Tensorflow based python library that extracts voice and sound from any music track. TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more! Brad Miro talks TensorFlow 2.0 features and how Google is using it internally
Read more
  • 0
  • 0
  • 33856

article-image-build-a-custom-news-feed-with-python-tutorial
Prasad Ramesh
10 Sep 2018
13 min read
Save for later

Build a custom news feed with Python [Tutorial]

Prasad Ramesh
10 Sep 2018
13 min read
To create a model a custom news feed, we need data which can be trained. This training data will be fed into a model in order to teach it to discriminate between the articles that we'd be interested in and the ones that we would not. This article is an excerpt from a book written by Alexander T. Combs titled Python Machine Learning Blueprints: Intuitive data projects you can relate to. In this article, we will learn to build a custom news corpus and annotate a large number of articles corresponding to the interests respectively. You can download the code and other relevant files used in this article from this GitHub link. Creating a supervised training dataset Before we can create a model of our taste in news articles, we need training data. This training data will be fed into our model in order to teach it to discriminate between the articles that we'd be interested in and the ones that we would not. To build this corpus, we will need to annotate a large number of articles that correspond to these interests. For each article, we'll label it either “y” or “n”. This will indicate whether the article is the one that we would want to have sent to us in our daily digest or not. To simplify this process, we will use the Pocket app. Pocket is an application that allows you to save stories to read later. You simply install the browser extension, and then click on the Pocket icon in your browser's toolbar when you wish to save a story. The article is saved to your personal repository. One of the great features of Pocket for our purposes is its ability to save the article with a tag of your choosing. We'll use this feature to mark interesting articles as “y” and non-interesting articles as “n”. Installing the Pocket Chrome extension We use Google Chrome here, but other browsers should work similarly. For Chrome, go into the Google App Store and look for the Extensions section: Image from https://chrome.google.com/webstore/search/pocket Click on the blue Add to Chrome button. If you already have an account, log in, and if you do not have an account, go ahead and sign up (it's free). Once this is complete, you should see the Pocket icon in the upper right-hand corner of your browser. It will be greyed out, but once there is an article you wish to save, you can click on it. It will turn red once the article has been saved as seen in the following images. The greyed out icon can be seen in the upper right-hand corner. Image from https://news.ycombinator.com When the icon is clicked, it turns red to indicated the article has been saved.  Image from https://www.wsj.com Now comes the fun part! Begin saving all articles that you come across. Tag the interesting ones with “y”, and the non-interesting ones with “n”. This is going to take some work. Your end results will only be as good as your training set, so you're going to to need to do this for hundreds of articles. If you forget to tag an article when you save it, you can always go to the site, http://www.get.pocket.com, to tag it there. Using the Pocket API to retrieve stories Now that you've diligently saved your articles to Pocket, the next step is to retrieve them. To accomplish this, we'll use the Pocket API. You can sign up for an account at https://getpocket.com/developer/apps/new. Click on Create New App in the upper left-hand side and fill in the details to get your API key. Make sure to click all of the permissions so that you can add, change, and retrieve articles. Image from https://getpocket.com/developer Once you have filled this in and submitted it, you will receive your CONSUMER KEY. You can find this in the upper left-hand corner under My Apps. This will look like the following screen, but obviously with a real key: Image from https://getpocket.com/developer Once this is set, you are ready to move on the the next step, which is to set up the authorizations. It requires that you input your consumer key and a redirect URL. The redirect URL can be anything. Here I have used my Twitter account: import requests auth_params = {'consumer_key': 'MY_CONSUMER_KEY', 'redirect_uri': 'https://www.twitter.com/acombs'} tkn = requests.post('https://getpocket.com/v3/oauth/request', data=auth_params) tkn.content You will see the following output: The output will have the code that you'll need for the next step. Place the following in your browser bar: https://getpocket.com/auth/authorize?request_token=some_long_code&redir ect_uri=https%3A//www.twitter.com/acombs If you change the redirect URL to one of your own, make sure to URL encode it. There are a number of resources for this. One option is to use the Python library urllib, another is to use a free online source. At this point, you should be presented with an authorization screen. Go ahead and approve it, and we can move on to the next step: usr_params = {'consumer_key':'my_consumer_key', 'code': 'some_long_code'} usr = requests.post('https://getpocket.com/v3/oauth/authorize', data=usr_params) usr.content We'll use the following output code here to move on to retrieving the stories: First, we retrieve the stories tagged “n”: no_params = {'consumer_key':'my_consumer_key', 'access_token': 'some_super_long_code', 'tag': 'n'} no_result = requests.post('https://getpocket.com/v3/get', data=no_params) no_result.text The preceding code generates the following output: Note that we have a long JSON string on all the articles that we tagged “n”. There are several keys in this, but we are really only interested in the URL at this point. We'll go ahead and create a list of all the URLs from this: no_jf = json.loads(no_result.text) no_jd = no_jf['list'] no_urls=[] for i in no_jd.values(): no_urls.append(i.get('resolved_url')) no_urls The preceding code generates the following output: This list contains all the URLs of stories that we aren't interested in. Now, let's put this in a DataFrame object and tag it as such: import pandas no_uf = pd.DataFrame(no_urls, columns=['urls']) no_uf = no_uf.assign(wanted = lambda x: 'n') no_uf The preceding code generates the following output: Now, we're all set with the unwanted stories. Let's do the same thing with the stories that we are interested in: ye_params = {'consumer_key': 'my_consumer_key', 'access_token': 'some_super_long_token', 'tag': 'y'} yes_result = requests.post('https://getpocket.com/v3/get', data=yes_params) yes_jf = json.loads(yes_result.text) yes_jd = yes_jf['list'] yes_urls=[] for i in yes_jd.values(): yes_urls.append(i.get('resolved_url')) yes_uf = pd.DataFrame(yes_urls, columns=['urls']) yes_uf = yes_uf.assign(wanted = lambda x: 'y') yes_uf The preceding code generates the following output: Now that we have both types of stories for our training data, let's join them together into a single DataFrame: df = pd.concat([yes_uf, no_uf]) df.dropna(inplace=1) df The preceding code generates the following output: Now that we're set with all our URLs and their corresponding tags in a single frame, we'll move on to downloading the HTML for each article. We'll use another free service for this called embed.ly. Using the embed.ly API to download story bodies We have all the URLs for our stories, but unfortunately this isn't enough to train on. We'll need the full article body. By itself, this could become a huge challenge if we wanted to roll our own scraper, especially if we were going to be pulling stories from dozens of sites. We would need to write code to target the article body while carefully avoiding all the othersite gunk that surrounds it. Fortunately, there are a number of free services that will do this for us. We're going to use embed.ly to do this, but there are a number of other services that you also could use. The first step is to sign up for embed.ly API access. You can do this at https://app.embed.ly/signup. This is a straightforward process. Once you confirm your registration, you will receive an API key.. You need to just use this key in your HTTPrequest. Let's do this now: import urllib def get_html(x): qurl = urllib.parse.quote(x) rhtml = requests.get('https://api.embedly.com/1/extract?url=' + qurl + '&key=some_api_key') ctnt = json.loads(rhtml.text).get('content') return ctnt df.loc[:,'html'] = df['urls'].map(get_html) df.dropna(inplace=1) df The preceding code generates the following output: With that, we have the HTML of each story. As the content is embedded in HTML markup, and we want to feed plain text into our model, we'll use a parser to strip out the markup tags: from bs4 import BeautifulSoup def get_text(x): soup = BeautifulSoup(x, 'lxml') text = soup.get_text() return text df.loc[:,'text'] = df['html'].map(get_text) df The preceding code generates the following output: With this, we have our training set ready. We can now move on to a discussion of how to transform our text into something that a model can work with. Setting up your daily personal newsletter In order to set up a personal e-mail with news stories, we're going to utilize IFTTT again. Build an App to Find Cheap Airfares, we'll use the Maker Channel to send a POST request. However, this time the payload will be our news stories. If you haven't set up the Maker Channel, do this now. Instructions can be found in Chapter 3, Build an App to Find Cheap Airfares. You should also set up the Gmail channel. Once that is complete, we'll add a recipe to combine the two. First, click on Create a Recipe from the IFTTT home page. Then, search for the Maker Channel: Image from https://www.iftt.com Select this, then select Receive a web request: Image from https://www.iftt.com Then, give the request a name. I'm using news_event: Image from https://www.iftt.com Finish by clicking on Create Trigger. Next, click on that to set up the e-mail piece. Search for Gmail and click on the icon seen as follows: Image from https://www.iftt.com Once you have clicked on Gmail, click on Send an e-mail. From here, you can customize your e-mail message. Image from https://www.iftt.com Input your e-mail address, a subject line, and finally, include Value1 in the e-mail body. We will pass our story title and link into this with our POST request. Click on Create Recipe to finalize this. Now, we're ready to generate the script that will run on a schedule automatically sending us articles of interest. We're going to create a separate script for this, but one last thing that we need to do in our existing code is serialize our vectorizer and our model: import pickle pickle.dump(model, open (r'/Users/alexcombs/Downloads/news_model_pickle.p', 'wb')) pickle.dump(vect, open (r'/Users/alexcombs/Downloads/news_vect_pickle.p', 'wb')) With this, we have saved everything that we need from our model. In our new script, we will read these in to generate our new predictions. We're going to use the same scheduling library to run the code that we used in Chapter  3, Build an App to Find Cheap Airfares. Putting it all together, we have the following script:   # get our imports. import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC import schedule import time import pickle import json import gspread import requests from bs4 import BeautifulSoup from oauth2client.client import SignedJwtAssertionCredentials # create our fetching function def fetch_news(): try: vect = pickle.load(open(r'/Users/alexcombs/Downloads/news_vect_pickle.p', 'rb')) model = pickle.load(open(r'/Users/alexcombs/Downloads/news_model_pickle.p', 'rb')) json_key = json.load(open(r'/Users/alexcombs/Downloads/APIKEY.json')) scope = ['https://spreadsheets.google.com/feeds'] credentials = SignedJwtAssertionCredentials(json_key['client_email'], json_key['private_key'].encode(), scope) gc = gspread.authorize(credentials) ws = gc.open("NewStories") sh = ws.sheet1 zd = list(zip(sh.col_values(2), sh.col_values(3), sh.col_values(4))) zf = pd.DataFrame(zd, columns=['title', 'urls', 'html']) zf.replace('', pd.np.nan, inplace=True) zf.dropna(inplace=True) def get_text(x): soup = BeautifulSoup(x, 'lxml') text = soup.get_text() return text zf.loc[:, 'text'] = zf['html'].map(get_text) tv = vect.transform(zf['text']) res = model.predict(tv) rf = pd.DataFrame(res, columns=['wanted']) rez = pd.merge(rf, zf, left_index=True, right_index=True) news_str = '' for t, u in zip(rez[rez['wanted'] == 'y']['title'], rez[rez['wanted'] == 'y']['urls']): news_str = news_str + t + '\n' + u + '\n' payload = {"value1": news_str} r = requests.post('https://maker.ifttt.com/trigger/news_event/with/key/IFTTT_KE Y', data=payload) # cleanup worksheet lenv = len(sh.col_values(1)) cell_list = sh.range('A1:F' + str(lenv)) for cell in cell_list: cell.value = "" sh.update_cells(cell_list) print(r.text) except: print('Failed') schedule.every(480).minutes.do(fetch_news) while 1: schedule.run_pending() time.sleep(1) What this script will do is run every 4 hours, pull down the news stories from Google Sheets, run the stories through the model, generate an e-mail by sending a POST request to IFTTT for the stories that are predicted to be of interest, and then finally, it will clear out the stories in the spreadsheet so that only new stories get sent in the next e-mail. Congratulations! You now have your own personalize news feed! In this tutorial we learned how to create a custom news feed, to know more about setting it up and other intuitive Python projects, check out Python Machine Learning Blueprints: Intuitive data projects you can relate to. Writing web services with functional Python programming [Tutorial] Visualizing data in R and Python using Anaconda [Tutorial] Python 3.7 beta is available as the second generation Google App Engine standard runtime
Read more
  • 0
  • 0
  • 33853

article-image-how-to-make-machine-learning-based-recommendations-using-julia-tutorial
Prasad Ramesh
08 Feb 2019
8 min read
Save for later

How to make machine learning based recommendations using Julia [Tutorial]

Prasad Ramesh
08 Feb 2019
8 min read
In this article, we will look at machine learning based recommendations using Julia. We will make recommendations using a Julia package called 'Recommendation'. This article is an excerpt from a book written by Adrian Salceanu titled Julia Programming Projects. In this book, you will learn how to build simple-to-advanced applications through examples in Julia Lang 1.x using modern tools. In order to ensure that your code will produce the same results as described in this article, it is recommended to use the same package versions. Here are the external packages used in this tutorial and their specific versions: CSV@v.0.4.3 DataFrames@v0.15.2 Gadfly@v1.0.1 IJulia@v1.14.1 Recommendation@v0.1.0+ In order to install a specific version of a package you need to run: pkg> add PackageName@vX.Y.Z For example: pkg> add IJulia@v1.14.1 Alternatively, you can install all the used packages by downloading the Project.toml file provided on GitHub. You can use pkg> instantiate as follows: julia> download("https://raw.githubusercontent.com/PacktPublishing/Julia-Projects/master/Chapter07/Project.toml", "Project.toml") pkg> activate . pkg> instantiate Julia's ecosystem provides access to Recommendation.jl, a package that implements a multitude of algorithms for both personalized and non-personalized recommendations. For model-based recommenders, it has support for SVD, MF, and content-based recommendations using TF-IDF scoring algorithms. There's also another very good alternative—the ScikitLearn.jl package (https://github.com/cstjean/ScikitLearn.jl). This implements Python's very popular scikit-learn interface and algorithms in Julia, supporting both models from the Julia ecosystem and those of the scikit-learn library (via PyCall.jl). The Scikit website and documentation can be found at http://scikit-learn.org/stable/. It is very powerful and definitely worth keeping in mind, especially for building highly efficient recommenders for production usage. For learning purposes, we'll stick to Recommendation, as it provides for a simpler implementation. Making recommendations with Recommendation For our learning example, we'll use Recommendation. It is the simplest of the available options, and it's a good teaching device, as it will allow us to further experiment with its plug-and-play algorithms and configurable model generators. Before we can do anything interesting, though, we need to make sure that we have the package installed: pkg> add Recommendation#master julia> using Recommendation Please note that I'm using the #master version, because the tagged version, at the time of writing this book, was not yet fully updated for Julia 1.0. The workflow for setting up a recommender with Recommendation involves three steps: Setting up the training data Instantiating and training a recommender using one of the available algorithms Once the training is complete, asking for recommendations Let's implement these steps. Setting up the training data Recommendation uses a DataAccessor object to set up the training data. This can be instantiated with a set of Event objects. A Recommendation.Event is an object that represents a user-item interaction. It is defined like this: struct Event user::Int item::Int value::Float64 end In our case, the user field will represent the UserID, the item field will map to the ISBN, and the value field will store the Rating. However, a bit more work is needed to bring our data in the format required by Recommendation: First of all, our ISBN data is stored as a string and not as an integer. Second, internally, Recommendation builds a sparse matrix of user *  item and stores the corresponding values, setting up the matrix using sequential IDs. However, our actual user IDs are large numbers, and Recommendation will set up a very large, sparse matrix, going all the way from the minimum to the maximum user IDs. What this means is that, for example, we only have 69 users in our dataset (as confirmed by unique(training_data[:UserID]) |> size), with the largest ID being 277,427, while for books we have 9,055 unique ISBNs. If we go with this, Recommendation will create a 277,427 x 9,055 matrix instead of a 69 x 9,055 matrix. This matrix would be very large, sparse, and inefficient. Therefore, we'll need to do a bit more data processing to map the original user IDs and the ISBNs to sequential integer IDs, starting from 1. We'll use two Dict objects that will store the mappings from the UserID and ISBN columns to the recommender's sequential user and book IDs. Each entry will be of the form dict[original_id] = sequential_id: julia> user_mappings, book_mappings = Dict{Int,Int}(), Dict{String,Int}() We'll also need two counters to keep track of, and increment, the sequential IDs: julia> user_counter, book_counter = 0, 0 We can now prepare the Event objects for our training data: julia> events = Event[] julia> for row in eachrow(training_data) global user_counter, book_counter user_id, book_id, rating = row[:UserID], row[:ISBN], row[:Rating] haskey(user_mappings, user_id) || (user_mappings[user_id] = (user_counter += 1)) haskey(book_mappings, book_id) || (book_mappings[book_id] = (book_counter += 1)) push!(events, Event(user_mappings[user_id], book_mappings[book_id], rating)) end This will fill up the events array with instances of Recommendation.Event, which represents a unique UserID, ISBN, and Rating combination. To give you an idea, it will look like this: julia> events 10005-element Array{Event,1}: Event(1, 1, 10.0) Event(1, 2, 8.0) Event(1, 3, 9.0) Event(1, 4, 8.0) Event(1, 5, 8.0) # output omitted # Please remember this very important aspect—in Julia, the for loop defines a new scope. This means that variables defined outside the for loop are not accessible inside it. To make them visible within the loop's body, we need to declare them as global. Now, we are ready to set up our DataAccessor: julia> da = DataAccessor(events, user_counter, book_counter) Building and training the recommender At this point, we have all that we need to instantiate our recommender. A very efficient and common implementation uses MF—unsurprisingly, this is one of the options provided by the Recommendation package, so we'll use it. Matrix Factorization The idea behind MF is that, if we're starting with a large sparse matrix like the one used to represent user x profile ratings, then we can represent it as the product of multiple smaller and denser matrices. The challenge is to find these smaller matrices so that their product is as close to our original matrix as possible. Once we have these, we can fill in the blanks in the original matrix so that the predicted values will be consistent with the existing ratings in the matrix: Our user x books rating matrix can be represented as the product between smaller and denser users and books matrices. To perform the matrix factorization, we can use a couple of algorithms, among which the most popular are SVD and Stochastic Gradient Descent (SGD). Recommendation uses SGD to perform matrix factorization. The code for this looks as follows: julia> recommender = MF(da) julia> build(recommender) We instantiate a new MF recommender and then we build it—that is, train it. The build step might take a while (a few minutes on a high-end computer using the small dataset that's provided on GitHub). If we want to tweak the training process, since SGD implements an iterative approach for matrix factorization, we can pass a max_iter argument to the build function, asking it for a maximum number of iterations. The more iterations we do, in theory, the better the recommendations—but the longer it will take to train the model. If you want to speed things up, you can invoke the build function with a max_iter of 30 or less—build(recommender, max_iter = 30). We can pass another optional argument for the learning rate, for example, build (recommender, learning_rate=15e-4, max_iter=100). The learning rate specifies how aggressively the optimization technique should vary between each iteration. If the learning rate is too small, the optimization will need to be run a lot of times. If it's too big, then the optimization might fail, generating worse results than the previous iterations. Making recommendations Now that we have successfully built and trained our model, we can ask it for recommendations. These are provided by the recommend function, which takes an instance of a recommender, a user ID (from the ones available in the training matrix), the number of recommendations, and an array of books ID from which to make recommendations as its arguments: julia> recommend(recommender, 1, 20, [1:book_counter...]) With this line of code, we retrieve the recommendations for the user with the recommender ID 1, which corresponds to the UserID 277427 in the original dataset. We're asking for up to 20 recommendations that have been picked from all the available books. We get back an array of a Pair of book IDs and recommendation scores: 20-element Array{Pair{Int64,Float64},1}: 5081 => 19.1974 5079 => 19.1948 5078 => 19.1946 5077 => 17.1253 5080 => 17.1246 # output omitted # In this article, we learned how to make recommendations with machine learning in Julia.  To learn more about machine learning recommendation in Julia and testing the model check out this book Julia Programming Projects. YouTube to reduce recommendations of ‘conspiracy theory’ videos that misinform users in the US How to Build a music recommendation system with PageRank Algorithm How to build a cold-start friendly content-based recommender using Apache Spark SQL
Read more
  • 0
  • 0
  • 33846

article-image-managing-heroku-command-line
Packt
20 Nov 2014
27 min read
Save for later

Managing Heroku from the Command Line

Packt
20 Nov 2014
27 min read
In this article by Mike Coutermarsh, author of Heroku Cookbook, we will cover the following topics: Viewing application logs Searching logs Installing add-ons Managing environment variables Enabling the maintenance page Managing releases and rolling back Running one-off tasks and dynos Managing SSH keys Sharing and collaboration Monitoring load average and memory usage (For more resources related to this topic, see here.) Heroku was built to be managed from its command-line interface. The better we learn it, the faster and more effective we will be in administering our application. The goal of this article is to get comfortable with using the CLI. We'll see that each Heroku command follows a common pattern. Once we learn a few of these commands, the rest will be relatively simple to master. In this article, we won't cover every command available in the CLI, but we will focus on the ones that we'll be using the most. As we learn each command, we will also learn a little more about what is happening behind the scenes so that we get a better understanding of how Heroku works. The more we understand, the more we'll be able to take advantage of the platform. Before we start, let's note that if we ever need to get a list of the available commands, we can run the following command: $ heroku help We can also quickly display the documentation for a single command: $ heroku help command_name Viewing application logs Logging gets a little more complex for any application that is running multiple servers and several different types of processes. Having visibility into everything that is happening within our application is critical to maintaining it. Heroku handles this by combining and sending all of our logs to one place, the Logplex. The Logplex provides us with a single location to view a stream of our logs across our entire application. In this recipe, we'll learn how to view logs via the CLI. We'll learn how to quickly get visibility into what's happening within our application. How to do it… To start, let's open up a terminal, navigate to an existing Heroku application, and perform the following steps: First, to view our applications logs, we can use the logs command: $ heroku logs2014-03-31T23:35:51.195150+00:00 app[web.1]:   Rendered pages/about.html.slim within layouts/application (25.0ms) 2014-03-31T23:35:51.215591+00:00 app[web.1]:   Rendered layouts/_navigation_links.html.erb (2.6ms)2014-03-31T23:35:51.230010+00:00 app[web.1]:   Rendered layouts/_messages.html.slim (13.0ms)2014-03-31T23:35:51.215967+00:00 app[web.1]:   Rendered layouts/_navigation.html.slim (10.3ms)2014-03-31T23:35:51.231104+00:00 app[web.1]: Completed 200 OK in 109ms (Views: 65.4ms | ActiveRecord: 0.0ms)2014-03-31T23:35:51.242960+00:00 heroku[router]: at=info method=GET path= Heroku logs anything that our application sends to STDOUT or STDERR. If we're not seeing logs, it's very likely our application is not configured correctly.  We can also watch our logs in real time. This is known as tailing: $ heroku logs --tail Instead of --tail, we can also use -t. We'll need to press Ctrl + C to end the command and stop tailing the logs. If we want to see the 100 most recent lines, we can use -n: $ heroku logs -n 100 The Logplex stores a maximum of 1500 lines. To view more lines, we'll have to set up a log storage. We can filter the logs to only show a specific process type. Here, we will only see logs from our web dynos: $ heroku logs -p web If we want, we can be as granular as showing the logs from an individual dyno. This will show only the logs from the second web dyno: $ heroku logs -p web.2 We can use this for any process type; we can try it for our workers if we'd like: $ heroku logs -p worker The Logplex contains more than just logs from our application. We can also view logs generated by Heroku or the API. Let's try changing the source to Heroku to only see the logs generated by Heroku. This will only show us logs related to the router and resource usage: $ heroku logs --source heroku To view logs for only our application, we can set the source to app: $ heroku logs --source app We can also view logs from the API. These logs will show any administrative actions we've taken, such as scaling dynos or changing configuration variables. This can be useful when multiple developers are working on an application: $ heroku logs --source api We can even combine the different flags. Let's try tailing the logs for only our web dynos: $ heroku logs -p web --tail That's it! Remember that if we ever need more information on how to view logs via the CLI, we can always use the help command: $ heroku help logs How it works Under the covers, the Heroku CLI is simply passes our request to Heroku's API and then uses Ruby to parse and display our logs. If you're interested in exactly how it works, the code is open source on GitHub at https://github.com/heroku/heroku/blob/master/lib/heroku/command/logs.rb. Viewing logs via the CLI is most useful in situations where we need to see exactly what our application is doing right now. We'll find that we use it a lot around deploys and when debugging issues. Since the Logplex has a limit of 1500 lines, it's not meant to view any historical data. For this, we'll need to set up log drains and enable a logging add-on. Searching logs Heroku does not have the built-in capability to search our logs from the command line. We can get around this limitation easily by making use of some other command-line tools. In this recipe, we will learn how to combine Heroku's logs with Grep, a command-line tool to search text. This will allow us to search our recent logs for keywords, helping us track down errors more quickly. Getting ready For this recipe, we'll need to have Grep installed. For OS X and Linux machines, it should already be installed. We can install Grep using the following steps: To check if we have Grep installed, let's open up a terminal and type the following: $ grepusage: grep [-abcDEFGHhIiJLlmnOoPqRSsUVvwxZ] [-A num] [-B num] [-C[num]]       [-e pattern] [-f file] [--binary-files=value] [--color=when]       [--context[=num]] [--directories=action] [--label] [--line-buffered]       [--null] [pattern] [file ...] If we do not see usage instructions, we can visit http://www.gnu.org/software/grep/ for the download and installation instructions. How to do it… Let's start searching our logs by opening a terminal and navigating to one of our Heroku applications using the following steps: To search for a keyword in our logs, we need to pipe our logs into Grep. This simply means that we will be passing our logs into Grep and having Grep search them for us. Let's try this now. The following command will search the output of heroku logs for the word error: $ heroku logs | grep error Sometimes, we might want to search for a longer string that includes special characters. We can do this by surrounding it with quotes: $ heroku logs | grep "path=/pages/about host" It can be useful to also see the lines surrounding the line that matched our search. We can do this as well. The next command will show us the line that contains an error as well as the three lines above and below it: $ heroku logs | grep error -C 3 We can even search with regular expressions. The next command will show us every line that matches a number that ends with MB. So, for example, lines with 100 MB, 25 MB, or 3 MB will all appear: $ heroku logs | grep 'd*MB' To learn more about regular expressions, visit http://regex.learncodethehardway.org/. How it works… Like most Unix-based tools, Grep was built to accomplish a single task and to do it well. Global regular expression print (Grep) is built to search a set of files for a pattern and then print all of the matches. Grep can also search anything it receives through standard input; this is exactly how we used it in this recipe. By piping the output of our Heroku logs into Grep, we are passing our logs to Grep as standard input. See also To learn more about Grep, visit http://www.tutorialspoint.com/unix_commands/grep.htm Installing add-ons Our application needs some additional functionality provided by an outside service. What should we do? In the past, this would have involved creating accounts, managing credentials, and, maybe, even bringing up servers and installing software. This whole process has been simplified by the Heroku add-on marketplace. For any additional functionality that our application needs, our first stop should always be Heroku add-ons. Heroku has made attaching additional resources to our application a plug-and-play process. If we need an additional database, caching, or error logging, they can be set up with a single command. In this recipe, we will learn the ins and outs of using the Heroku CLI to install and manage our application's add-ons. How to do it... To begin, let's open a terminal and navigate to one of our Heroku applications using the following steps: Let's start by taking a look at all of the available Heroku add-ons. We can do this with the addons:list command: $ heroku addons:list There are so many add-ons that viewing them through the CLI is pretty difficult. For easier navigation and search, we should take a look at https://addons.heroku.com/. If we want to see the currently installed add-ons for our application, we can simply type the following: $ heroku addons=== load-tester-rails Configured Add-onsheroku-postgresql:dev       HEROKU_POSTGRESQL_MAROONheroku-postgresql:hobby-dev HEROKU_POSTGRESQL_ONYXlibrato:developmentnewrelic:stark Remember that for any command, we can always add --app app_name to specify the application. Alternatively, our application's add-ons are also listed through the Heroku Dashboard available at https://dashboard.heroku.com. The installation of a new add-on is done with addons:add. Here, we are going to install the error logging service, Rollbar: $ heroku addons:add rollbarheroku addons:add rollbarAdding rollbar on load-tester-rails... done, v22 (free)Use `heroku addons:docs rollbar` to view documentation. We can quickly open up the documentation for an add-on with addons:docs: $ heroku addons:docs rollbar Removing an add-on is just as simple. We'll need to type our application name to confirm. For this example, our application is called load-tester-rails: $ heroku addons:remove rollbar!   WARNING: Destructive Action!   This command will affect the app: load-tester-rails!   To proceed, type "load-tester-rails" or re-run this command with --confirm load-tester-rails > load-tester-railsRemoving rollbar on load-tester-rails... done, v23 (free) Each add-on comes with different tiers of service. Let's try upgrading our rollbar add-on to the starter tier: $ heroku addons:upgrade rollbar:starterUpgrading to rollbar:starter on load-tester-rails... done, v26 ($12/mo)Plan changed to starterUse `heroku addons:docs rollbar` to view documentation. Now, if we want, we can downgrade back to its original level with addons:downgrade: $ heroku addons:downgrade rollbarDowngrading to rollbar on load-tester-rails... done, v27 (free)Plan changed to freeUse `heroku addons:docs rollbar` to view documentation. If we ever forget any of the commands, we can always use help to quickly see the documentation: $ heroku help addons Some add-ons might charge you money. Before continuing, let's double check that we only have the correct ones enabled, using the $ heroku addons command. How it works… Heroku has created a standardized process for all add-on providers to follow. This ensures a consistent experience when provisioning any add-on for our application. It starts when we request the creation of an add-on. Heroku sends an HTTP request to the provider, asking them to provision an instance of their service. The provider must then respond to Heroku with the connection details for their service in the form of environment variables. For example, if we were to provision Redis To Go, we will get back our connection details in a REDISTOGO_URL variable: REDISTOGO_URL: redis://user:pass@server.redistogo.com:9652 Heroku adds these variables to our application and restarts it. On restart, the variables are available for our application, and we can connect to the service using them. The specifics on how to connect using the variables will be in the add-ons documentation. Installation will depend on the specific language or framework we're using. See also For details on creating our own add-ons, the process is well documented on Heroku's website at https://addons.heroku.com/provider Check out Kensa, the CLI to create Heroku add-ons, at https://github.com/heroku/kensa Managing environment variables Our applications will often need access to various credentials in the form of API tokens, usernames, and passwords for integrations with third-party services. We can store this information in our Git repository, but then, anyone with access to our code will also have a copy of our production credentials. We should instead use environment variables to store any configuration information for our application. Configuration information should be separate from our application's code and instead be tied to the specific deployment of the application. Changing our application to use environment variables is simple. Let's look at an example in Ruby; let's assume that we currently have secret_api_token defined in our application's code: secret_api_token = '123abc' We can remove the token and replace it with an environment variable: secret_api_token = ENV['SECRET_TOKEN'] In addition to protecting our credentials, using environment variables makes our application more configurable. We'll be able to quickly make configuration changes without having to change code and redeploy. The terms "configuration variable" and "environment variable" are interchangeable. Heroku usually uses "configuration" due to how tightly the variables are coupled with the state of the application. How to do it... Heroku makes it easy to set our application's environment variables through the config command. Let's launch a terminal and navigate to an existing Heroku project to try it out, using the following steps: We can use the config command to see a list of all our existing environment variables: $ heroku config To view only the value of a specific variable, we can use get: $ heroku config:get DATABASE_URL To set a new variable, we can use set: $ heroku config:set VAR_NAME=var_valueSetting config vars and restarting load-tester-rails... done, v28VAR_NAME: var_value Each time we set a config variable, Heroku will restart our application. We can set multiple values at once to avoid multiple restarts: $ heroku config:set SECRET=value SECRET2=valueSetting config vars and restarting load-tester-rails... done, v29SECRET: valueSECRET2: value To delete a variable, we use unset: $ heroku config:unset SECRETUnsetting SECRET and restarting load-tester-rails... done, v30 If we want, we can delete multiple variables with a single command: $ heroku config:unset VAR_NAME SECRET2Unsetting VAR_NAME and restarting load-tester-rails... done, v31Unsetting SECRET2 and restarting load-tester-rails... done, v32 Heroku tracks each configuration change as a release. This makes it easy for us to roll back changes if we make a mistake. How it works… Environment variables are used on Unix-based operating systems to manage and share configuration information between applications. As they are so common, changing our application to use them does not lock us into deploying only to Heroku. Heroku stores all of our configuration variables in one central location. Each change to these variables is tracked, and we can view the history by looking through our past releases. When Heroku spins up a new dyno, part of the process is taking all of our configuration settings and setting them as environment variables on the dyno. This is why whenever we make a configuration change, Heroku restarts our dynos. As configuration variables are such a key part of our Heroku application, any change to them will also be included in our Heroku logs. See also Read about the Twelve-Factor app's rule on configuration at http://12factor.net/config Enabling the maintenance page Occasionally, we will need to make changes to our application that requires downtime. The proper way to do this is to put up a maintenance page that displays a friendly message and respond to all the incoming HTTP requests with a 503 Service Unavailable status. Doing this will keep our users informed and also avoid any negative SEO effects. Search engines understand that when they receive a 503 response, they should come back later to recrawl the site. If we didn't use a maintenance page and our application returned a 404 or 500 errors instead, it's possible that a search engine crawler might remove the page from their index. How to do it... Let's open up a terminal and navigate to one of our Heroku projects to begin with, using the following steps: We can view if our application's maintenance page is currently enabled with the maintenance command: $ heroku maintenanceoff Let's try turning it on. This will stop traffic from being routed to our dynos and show the maintenance page as follows: $ heroku maintenance:onEnabling maintenance mode for load-tester-rails... done Now, if we visit our application, we'll see the default Heroku maintenance page: To disable the maintenance page and resume sending users to our application, we can use the maintenance:off command: $ heroku maintenance:offDisabling maintenance mode for load-tester-rails... done Managing releases and rolling back What do we do if disaster strikes and our newly released code breaks our application? Luckily for us, Heroku keeps a copy of every deploy and configuration change to our application. This enables us to roll back to a previous version while we work to correct the errors in our latest release. Heads up! Rolling back only affects application code and configuration variables. Add-ons and our database will not be affected by a rollback. In this recipe, we will learn how to manage our releases and roll back code from the CLI. How to do it... In this recipe, we'll view and manage our releases from the Heroku CLI, using the releases command. Let's open up a terminal now and navigate to one of our Heroku projects by performing the following steps: Heroku tracks every deploy and configuration change as a release. We can view all of our releases from both the CLI and the web dashboard with the releases command: $ heroku releases=== load-tester-rails Releasesv33 Add WEB_CON config vars coutermarsh.mike@gmail.com 2014/03/30 11:18:49 (~ 5h ago)v32 Remove SEC config vars       coutermarsh.mike@gmail.com 2014/03/29 19:38:06 (~ 21h ago)v31 Remove VAR config vars     coutermarsh.mike@gmail.com 2014/03/29 19:38:05 (~ 21h ago)v30 Remove config vars       coutermarsh.mike@gmail.com 2014/03/29 19:27:05 (~ 21h ago)v29 Deploy 9218c1c vars coutermarsh.mike@gmail.com 2014/03/29 19:24:29 (~ 21h ago) Alternatively, we can view our releases through the Heroku dashboard. Visit https://dashboard.heroku.com, select one of our applications, and click on Activity: We can view detailed information about each release using the info command. This shows us everything about the change and state of the application during this release: $ heroku releases:info v33=== Release v33Addons: librato:development       newrelic:stark       rollbar:free       sendgrid:starterBy:     coutermarsh.mike@gmail.comChange: Add WEB_CONCURRENCY config varsWhen:   2014/03/30 11:18:49 (~ 6h ago)=== v33 Config VarsWEB_CONCURRENCY: 3 We can revert to the previous version of our application with the rollback command: $ heroku rollbackRolling back load-tester-rails... done, v32!   Warning: rollback affects code and config vars; it doesn't add or remove addons. To undo, run: heroku rollback v33 Rolling back creates a new version of our application in the release history. We can also specify a specific version to roll back to: $ heroku rollback v30Rolling back load-tester-rails... done, v30 The version we roll back to does not have to be an older version. Although it sounds contradictory, we can also roll back to newer versions of our application. How it works… Behind the scenes, each Heroku release is tied to a specific slug and set of configuration variables. As Heroku keeps a copy of each slug that we deploy, we're able to quickly roll back to previous versions of our code without having to rebuild our application. For each deploy release created, it will include a reference to the Git SHA that was pushed to master. The Git SHA is a reference to the last commit made to our repository before it was deployed. This is useful if we want to know exactly what code was pushed out in that release. On our local machine, we can run the $ git checkout git-sha-here command to view our application's code in the exact state it was when deployed. Running one-off tasks and dynos In more traditional hosting environments, developers will often log in to servers to perform basic administrative tasks or debug an issue. With Heroku, we can do this by launching one-off dynos. These are dynos that contain our application code but do not serve web requests. For a Ruby on Rails application, one-off dynos are often used to run database migrations or launch a Rails console. How to do it... In this recipe, we will learn how to execute commands on our Heroku applications with the heroku run command. Let's launch a terminal now to get started with the following steps: To have Heroku start a one-off dyno and execute any single command, we will use heroku run. Here, we can try it out by running a simple command to print some text to the screen: $ heroku run echo "hello heroku"Running `echo "hello heroku"` attached to terminal... up, run.7702"hello heroku" One-off dynos are automatically shut down after the command has finished running. We can see that Heroku is running this command on a dyno with our application's code. Let's run ls to see a listing of the files on the dyno. They should look familiar: $ heroku run lsRunning `ls` attached to terminal... up, run.5518app bin config config.ru db Gemfile Gemfile.lock lib log Procfile     public Rakefile README README.md tmp If we want to run multiple commands, we can start up a bash session. Type exit to close the session: $ heroku run bashRunning `bash` attached to terminal... up, run.2331~ $ lsapp bin config config.ru db Gemfile Gemfile.lock      lib log Procfile public Rakefile README README.md tmp~ $ echo "hello"hello~ $ exitexit We can run tasks in the background using the detached mode. The output of the command goes to our logs rather than the screen: $ heroku run:detached echo "hello heroku"Running `echo hello heroku` detached... up, run.4534Use `heroku logs -p run.4534` to view the output. If we need more power, we can adjust the size of the one-off dynos. This command will launch a bash session in a 2X dyno: $ heroku run --size=2X bash If we are running one-off dynos in the detached mode, we can view their status and stop them in the same way we would stop any other dyno: $ heroku ps=== run: one-off processesrun.5927 (1X): starting 2014/03/29 16:18:59 (~ 6s ago)$ heroku ps:stop run.5927 How it works… When we issue the heroku run command, Heroku spins up a new dyno with our latest slug and runs the command. Heroku does not start our application; the only command that runs is the command that we explicitly pass to it. One-off dynos act a little differently than standard dynos. If we create one dyno in the detached mode, it will run until we stop it manually, or it will shut down automatically after 24 hours. It will not restart like a normal dyno will. If we run bash from a one-off dyno, it will run until we close the connection or reach an hour of inactivity. Managing SSH keys Heroku manages access to our application's Git repository with SSH keys. When we first set up the Heroku Toolbelt, we had to upload either a new or existing public key to Heroku's servers. This key allows us to access our Heroku Git repositories without entering our password each time. If we ever want to deploy our Heroku applications from another computer, we'll either need to have the same key on that computer or provide Heroku with an additional one. It's easy enough to do this via the CLI, which we'll learn in this recipe. How to do it… To get started, let's fire up a terminal. We'll be using the keys command in this recipe by performing the following steps: First, let's view all of the existing keys in our Heroku account: $ heroku keys=== coutermarsh.mike@gmail.com Keysssh-rsa AAAAB3NzaC...46hEzt1Q== coutermarsh.mike@gmail.comssh-rsa AAAAB3NzaC...6EU7Qr3S/v coutermarsh.mike@gmail.comssh-rsa AAAAB3NzaC...bqCJkM4w== coutermarsh.mike@gmail.com To remove an existing key, we can use keys:remove. To the command, we need to pass a string that matches one of the keys: $ heroku keys:remove "7Qr3S/v coutermarsh.mike@gmail.com"Removing 7Qr3S/v coutermarsh.mike@gmail.com SSH key... done To add our current user's public key, we can use keys:add. This will look on our machine for a public key (~/.ssh/id_rsa.pub) and upload it: $ heroku keys:addFound existing public key: /Users/mike/.ssh/id_rsa.pubUploading SSH public key /Users/mike/.ssh/id_rsa.pub… done To create a new SSH key, we can run $ ssh-keygen -t rsa. If we'd like, we can also specify where the key is located if it is not in the default /.ssh/ directory: $ heroku keys:add /path/to/key.pub How it works… SSH keys are the standard method for password-less authentication. There are two parts to each SSH key. There is a private key, which stays on our machine and should never be shared, and there is a public key, which we can freely upload and share. Each key has its purpose. The public key is used to encrypt messages. The private key is used to decrypt messages. When we try to connect to our Git repositories, Heroku's server uses our public key to create an encrypted message that can only be decrypted by our private key. The server then sends the message to our machine; our machine's SSH client decrypts it and sends the response to the server. Sending the correct response successfully authenticates us. SSH keys are not used for authentication to the Heroku CLI. The CLI uses an authentication token that is stored in our ~/.netrc file. Sharing and collaboration We can invite collaborators through both the web dashboard and the CLI. In this recipe, we'll learn how to quickly invite collaborators through the CLI. How to do it… To start, let's open a terminal and navigate to the Heroku application that we would like to share, using the following steps: To see the current users who have access to our application, we can use the sharing command: $ heroku sharing=== load-tester-rails Access Listcoutermarsh.mike@gmail.com ownermike@form26.com             collaborator To invite a collaborator, we can use sharing:add: $ heroku sharing:add coutermarshmike@gmail.com Adding coutermarshmike@gmail.com to load-tester-rails as collaborator... done Heroku will send an e-mail to the user we're inviting, even if they do not already have a Heroku account. If we'd like to revoke access to our application, we can do so with sharing:remove:$ heroku sharing:remove coutermarshmike@gmail.comRemoving coutermarshmike@gmail.com from load-tester-rails collaborators... done How it works… When we add another collaborator to our Heroku application, they are granted the same abilities as us, except that they cannot manage paid add-ons or delete the application. Otherwise, they have full control to administrate the application. If they have an existing Heroku account, their SSH key will be immediately added to the application's Git repository. See also Interested in using multiple Heroku accounts on a single machine? Take a look at the Heroku-accounts plugin at https://github.com/ddollar/heroku-accounts. Monitoring load average and memory usage We can monitor the resource usage of our dynos from the command line using the log-runtime-metrics plugin. This will give us visibility into the CPU and memory usage of our dynos. With this data, we'll be able to determine if our dynos are correctly sized, detect problems earlier, and determine whether we need to scale our application. How to do it… Let's open up a terminal; we'll be completing this recipe with the CLI by performing the following steps: First, we'll need to install the log-runtime-metrics plugin via the CLI. We can do this easily through heroku labs: $ heroku labs:enable log-runtime-metrics Now that the runtime metrics plugin is installed, we'll need to restart our dynos for it to take effect: $ heroku restart Now that the plugin is installed and running, our dynos' resource usage will be printed to our logs. Let's view them now: $ heroku logsheroku[web.1]: source=web.1 dyno=heroku.21 sample#load_avg_1m=0.00 sample#load_avg_5m=0.00heroku[web.1]: source=web.1 dyno=heroku.21sample#memory_total=105.28MB sample#memory_rss=105.28MBsample#memory_cache=0.00MBsample#memory_swap=0.00MBsample#memory_pgpgin=31927pagessample#memory_pgpgout=4975pages From the logs, we can see that for this application, our load average is 0, and this dyno is using a total of 105 MB of RAM. How it works… Now that we have some insight into how our dynos are using resources, we need to learn how to interpret these numbers. Understanding the utilization of our dynos will be key for us if we ever need to diagnose a performance-related issue. In our logs, we will now see load_avg_1m and load_avg_5m. This is our dynos' load average over a 1-minute and 5-minute period. The timeframes are helpful in determining whether we're experiencing a brief spike in activity or it is more sustained. Load average is the amount of total computational work that the CPU has to complete. The 1X and 2X dynos have access to four virtual cores. A load average of four means that the dynos' CPU is fully utilized. Any value above four is a warning sign that the dyno might be overloaded, and response times could begin to suffer. Web applications are typically not CPU-intensive applications, seeing low load averages for web dynos should be expected. If we start seeing high load averages, we should consider either adding more dynos or using larger dynos to handle the load. Our memory usage is also shown in the logs. The key value that we want to keep track of is memory_rrs, which is the total amount of RAM being utilized by our application. It's best to keep this value no higher than 50 to 70 percent of the total RAM available on the dyno. For a 1X dyno with 512 MB of memory, this would mean keeping our memory usage no greater than 250 to 350 MB. This allows our application's room to grow under load and helps us avoid any memory swapping. Seeing values above 70 percent is an indication that we need to either adjust our application's memory usage or scale up. Memory swap occurs when our dyno runs out of RAM. To compensate, our dyno will begin using its hard drive to store data that will normally be stored in RAM. For any web application, any swap should be considered evil. This value should always be zero. If our dyno starts swapping, we can expect that it will significantly slow down our application's response times. Seeing any swap is an immediate indication that we must either reduce our application's memory consumption or start scaling. See also Load average and memory usage are particularly useful when performing application load tests. Summary In this article, we learned various commands on how to view application logs, installing add-ons, viewing application logs, enabling the maintenance page, managing SSH keys, sharing and collaboration, and so on. Resources for Article: Further resources on this subject: Securing vCloud Using the vCloud Networking and Security App Firewall [article] vCloud Networks [article] Apache CloudStack Architecture [article]
Read more
  • 0
  • 0
  • 33833

article-image-collaboration-using-github-workflow
Packt
30 Sep 2015
12 min read
Save for later

Collaboration Using the GitHub Workflow

Packt
30 Sep 2015
12 min read
In this article by Achilleas Pipinellis, the author of the book GitHub Essentials, has come up with a workflow based on the features it provides and the power of Git. It has named it the GitHub workflow (https://guides.github.com/introduction/flow). In this article, we will learn how to work with branches and pull requests, which is the most powerful feature of GitHub. (For more resources related to this topic, see here.) Learn about pull requests Pull request is the number one feature in GitHub that made it what it is today. It was introduced in early 2008 and is being used extensively among projects since then. While everything else can be pretty much disabled in a project's settings (such as issues and the wiki), pull requests are always enabled. Why pull requests are a powerful asset to work with Whether you are working on a personal project where you are the sole contributor or on a big open source one with contributors from all over the globe, working with pull requests will certainly make your life easier. I like to think of pull requests as chunks of commits, and the GitHub UI helps you visualize clearer what is about to be merged in the default branch or the branch of your choice. Pull requests are reviewable with an enhanced diff view. You can easily revert them with a simple button on GitHub and they can be tested before merging, if a CI service is enabled in the project. The connection between branches and pull requests There is a special connection between branches and pull requests. In this connection, GitHub will automatically show you a button to create a new pull request if you push a new branch in your repository. As we will explore in the following sections, this is tightly coupled to the GitHub workflow, and GitHub uses some special words to describe the from and to branches. As per GitHub's documentation: The base branch is where you think changes should be applied, the head branch is what you would like to be applied. So, in GitHub terms, head is your branch, and base the branch you would like to merge into. Create branches directly in a project – the shared repository model The shared repository model, as GitHub aptly calls it, is when you push new branches directly to the source repository. From there, you can create a new pull request by comparing between branches, as we will see in the following sections. Of course, in order to be able to push to a repository you either have to be the owner or a collaborator; in other words you must have write access. Create branches in your fork – the fork and pull model Forked repositories are related to their parent in a way that GitHub uses in order to compare their branches. The fork and pull model is usually used in projects when one does not have write access but is willing to contribute. After forking a repository, you push a branch to your fork and then create a pull request in the source repository asking its maintainer to merge the changes. This is common practice to contribute to open source projects hosted on GitHub. You will not have access to their repository, but being open source, you can fork the public repository and work on your own copy. How to create and submit a pull request There are quite a few ways to initiate the creation of a pull request, as we you will see in the following sections. The most common one is to push a branch to your repository and let GitHub's UI guide you. Let's explore this option first. Use the Compare & pull request button Whenever a new branch is pushed to a repository, GitHub shows a quick button to create a pull request. In reality, you are taken to the compare page, as we will explore in the next section, but some values are already filled out for you. Let's create, for example, a new branch named add_gitignore where we will add a .gitignore file with the following contents: git checkout -b add_gitignore echo -e '.bundlen.sass-cachen.vendorn_site' > .gitignore git add .gitignore git commit -m 'Add .gitignore' git push origin add_gitignore Next, head over your repository's main page and you will notice the Compare & pull request button, as shown in the following screenshot: From here on, if you hit this button you will be taken to the compare page. Note that I am pushing to my repository following the shared repository model, so here is how GitHub greets me: What would happen if I used the fork and pull repository model? For this purpose, I created another user to fork my repository and followed the same instructions to add a new branch named add_gitignore with the same changes. From here on, when you push the branch to your fork, the Compare & pull request button appears whether you are on your fork's page or on the parent repository. Here is how it looks if you visit your fork: The following screenshot will appear, if you visit the parent repository: In the last case (captured in red), you can see from which user this branch came from (axil43:add_gitignore). In either case, when using the fork and pull model, hitting the Compare & pull request button will take you to the compare page with slightly different options: Since you are comparing across forks, there are more details. In particular, you can see the base fork and branch as well as the head fork and branch that are the ones you are the owner of. GitHub considers the default branch set in your repository to be the one you want to merge into (base) when the Create Pull Request button appears. Before submitting it, let's explore the other two options that you can use to create a pull request. You can jump to the Submit a pull request section if you like. Use the compare function directly As mentioned in the previous section, the Compare & pull request button gets you on the compare page with some predefined values. The button appears right after you push a new branch and is there only for a few moments. In this section, we will see how to use the compare function directly in order to create a pull request. You can access the compare function by clicking on the green button next to the branch drop-down list on a repository's main page: This is pretty powerful as one can compare across forks or, in the same repository, pretty much everything—branches, tags, single commits and time ranges. The default page when you land on the compare page is like the following one; you start by comparing your default branch with GitHub, proposing a list of recently created branches to choose from and compare: In order to have something to compare to, the base branch must be older than what you are comparing to. From here, if I choose the add_gitignore branch, GitHub compares it to a master and shows the diff along with the message that it is able to be merged into the base branch without any conflicts. Finally, you can create the pull request: Notice that I am using the compare function while I'm at my own repository. When comparing in a repository that is a fork of another, the compare function slightly changes and automatically includes more options as we have seen in the previous section. As you may have noticed the Compare & pull request quick button is just a shortcut for using compare manually. If you want to have more fine-grained control on the repositories and the branches compared, use the compare feature directly. Use the GitHub web editor So far, we have seen the two most well-known types of initiating a pull request. There is a third way as well: using entirely the web editor that GitHub provides. This can prove useful for people who are not too familiar with Git and the terminal, and can also be used by more advanced Git users who want to propose a quick change. As always, according to the model you are using (shared repository or fork and pull), the process is a little different. Let's first explore the shared repository model flow using the web editor, which means editing files in a repository that you own. The shared repository model Firstly, make sure you are on the branch that you wish to branch off; then, head over a file you wish to change and press the edit button with the pencil icon: Make the change you want in that file, add a proper commit message, and choose Create a new branch giving the name of the branch you wish to create. By default, the branch name is username-patch-i, where username is your username and i is an increasing integer starting from 1. Consecutive edits on files will create branches such as username-patch-1, username-patch-2, and so on. In our example, I decided to give the branch a name of my own: When ready, press the Propose file change button. From this moment on, the branch is created with the file edits you made. Even if you close the next page, your changes will not be lost. Let's skip the pull request submission for the time being and see how the fork and pull model works. The fork and pull model In the fork and pull model, you fork a repository and submit a pull request from the changes you make in your fork. In the case of using the web editor, there is a caveat. In order to get GitHub automatically recognize that you wish to perform a pull request in the parent repository, you have to start the web editor from the parent repository and not your fork. In the following screenshot, you can see what happens in this case: GitHub informs you that a new branch will be created in your repository (fork) with the new changes in order to submit a pull request. Hitting the Propose file change button will take you to the form to submit the pull request: Contrary to the shared repository model, you can now see the base/head repositories and branches that are compared. Also, notice that the default name for the new branch is patch-i, where i is an increasing integer number. In our case, this was the first branch created that way, so it was named patch-1. If you would like to have the ability to name the branch the way you like, you should follow the shared repository model instructions as explained in preceding section. Following that route, edit the file in your fork where you have write access, add your own branch name, hit the Propose file change button for the branch to be created, and then abort when asked to create the pull request. You can then use the Compare & pull request quick button or use the compare function directly to propose a pull request to the parent repository. One last thing to consider when using the web editor, is the limitation of editing one file at a time. If you wish to include more changes in the same branch that GitHub created for you when you first edited a file, you must first change to that branch and then make any subsequent changes. How to change the branch? Simply choose it from the drop-down menu as shown in the following screenshot: Submit a pull request So far, we have explored the various ways to initiate a pull request. In this section, we will finally continue to submit it as well. The pull request form is identical to the form when creating a new issue. If you have write access to the repository that you are making the pull request to, then you are able to set labels, milestone, and assignee. The title of the pull request is automatically filled by the last commit message that the branch has, or if there are multiple commits, it will just fill in the branch name. In either case, you can change it to your liking. In the following image, you can see the title is taken from the branch name after GitHub has stripped the special characters. In a sense, the title gets humanized: You can add an optional description and images if you deem proper. Whenever ready, hit the Create pull request button. In the following sections, we will explore how the peer review works. Peer review and inline comments The nice thing about pull requests is that you have a nice and clear view of what is about to get merged. You can see only the changes that matter, and the best part is that you can fire up a discussion concerning those changes. In the previous section, we submitted the pull request so that it can be reviewed and eventually get merged. Suppose that we are collaborating with a team and they chime in to discuss the changes. Let's first check the layout of a pull request. Summary In this article, we explored the GitHub workflow and the various ways to perform a pull request, as well as the many features GitHub provides to make that workflow even smoother. This is how the majority of open source projects work when there are dozens of contributors involved. Resources for Article: Further resources on this subject: Git Teaches – Great Tools Don't Make Great Craftsmen[article] Maintaining Your GitLab Instance[article] Configuration [article]
Read more
  • 0
  • 0
  • 33829

article-image-multithreading-in-rust-using-crates-tutorial
Aaron Lazar
15 Aug 2018
17 min read
Save for later

Multithreading in Rust using Crates [Tutorial]

Aaron Lazar
15 Aug 2018
17 min read
The crates.io ecosystem in Rust can make use of approaches to improve our development speed as well as the performance of our code. In this tutorial, we'll learn how to use the crates ecosystem to manipulate threads in Rust. This article is an extract from Rust High Performance, authored by Iban Eguia Moraza. Using non-blocking data structures One of the issues we saw earlier was that if we wanted to share something more complex than an integer or a Boolean between threads and if we wanted to mutate it, we needed to use a Mutex. This is not entirely true, since one crate, Crossbeam, allows us to use great data structures that do not require locking a Mutex. They are therefore much faster and more efficient. Often, when we want to share information between threads, it's usually a list of tasks that we want to work on cooperatively. Other times, we want to create information in multiple threads and add it to a list of information. It's therefore not so usual for multiple threads to be working with exactly the same variables since as we have seen, that requires synchronization and it will be slow. This is where Crossbeam shows all its potential. Crossbeam gives us some multithreaded queues and stacks, where we can insert data and consume data from different threads. We can, in fact, have some threads doing an initial processing of the data and others performing a second phase of the processing. Let's see how we can use these features. First, add crossbeam to the dependencies of the crate in the Cargo.toml file. Then, we start with a simple example: extern crate crossbeam; use std::thread; use std::sync::Arc; use crossbeam::sync::MsQueue; fn main() { let queue = Arc::new(MsQueue::new()); let handles: Vec<_> = (1..6) .map(|_| { let t_queue = queue.clone(); thread::spawn(move || { for _ in 0..1_000_000 { t_queue.push(10); } }) }) .collect(); for handle in handles { handle.join().unwrap(); } let final_queue = Arc::try_unwrap(queue).unwrap(); let mut sum = 0; while let Some(i) = final_queue.try_pop() { sum += i; } println!("Final sum: {}", sum); } Let's first understand what this example does. It will iterate 1,000,000 times in 5 different threads, and each time it will push a 10 to a queue. Queues are FIFO lists, first input, first output. This means that the first number entered will be the first one to pop() and the last one will be the last to do so. In this case, all of them are a 10, so it doesn't matter. Once the threads finish populating the queue, we iterate over it and we add all the numbers. A simple computation should make you able to guess that if everything goes perfectly, the final number should be 50,000,000. If you run it, that will be the result, and that's not all. If you run it by executing cargo run --release, it will run blazingly fast. On my computer, it took about one second to complete. If you want, try to implement this code with the standard library Mutex and vector, and you will see that the performance difference is amazing. As you can see, we still needed to use an Arc to control the multiple references to the queue. This is needed because the queue itself cannot be duplicated and shared, it has no reference count. Crossbeam not only gives us FIFO queues. We also have LIFO stacks. LIFO comes from last input, first output, and it means that the last element you inserted in the stack will be the first one to pop(). Let's see the difference with a couple of threads: extern crate crossbeam; use std::thread; use std::sync::Arc; use std::time::Duration; use crossbeam::sync::{MsQueue, TreiberStack}; fn main() { let queue = Arc::new(MsQueue::new()); let stack = Arc::new(TreiberStack::new()); let in_queue = queue.clone(); let in_stack = stack.clone(); let in_handle = thread::spawn(move || { for i in 0..5 { in_queue.push(i); in_stack.push(i); println!("Pushed :D"); thread::sleep(Duration::from_millis(50)); } }); let mut final_queue = Vec::new(); let mut final_stack = Vec::new(); let mut last_q_failed = 0; let mut last_s_failed = 0; loop { // Get the queue match queue.try_pop() { Some(i) => { final_queue.push(i); last_q_failed = 0; println!("Something in the queue! :)"); } None => { println!("Nothing in the queue :("); last_q_failed += 1; } } // Get the stack match stack.try_pop() { Some(i) => { final_stack.push(i); last_s_failed = 0; println!("Something in the stack! :)"); } None => { println!("Nothing in the stack :("); last_s_failed += 1; } } // Check if we finished if last_q_failed > 1 && last_s_failed > 1 { break; } else if last_q_failed > 0 || last_s_failed > 0 { thread::sleep(Duration::from_millis(100)); } } in_handle.join().unwrap(); println!("Queue: {:?}", final_queue); println!("Stack: {:?}", final_stack); } As you can see in the code, we have two shared variables: a queue and a stack. The secondary thread will push new values to each of them, in the same order, from 0 to 4. Then, the main thread will try to get them back. It will loop indefinitely and use the try_pop() method. The pop() method can be used, but it will block the thread if the queue or the stack is empty. This will happen in any case once all values get popped since no new values are being added, so the try_pop() method will help not to block the main thread and end gracefully. The way it checks whether all the values were popped is by counting how many times it failed to pop a new value. Every time it fails, it will wait for 100 milliseconds, while the push thread only waits for 50 milliseconds between pushes. This means that if it tries to pop new values two times and there are no new values, the pusher thread has already finished. It will add values as they are popped to two vectors and then print the result. In the meantime, it will print messages about pushing and popping new values. You will understand this better by seeing the output: Note that the output can be different in your case, since threads don't need to be executed in any particular order. In this example output, as you can see, it first tries to get something from the queue and the stack but there is nothing there, so it sleeps. The second thread then starts pushing things, two numbers actually. After this, the queue and the stack will be [0, 1]. Then, it pops the first item from each of them. From the queue, it will pop the 0 and from the stack it will pop the 1 (the last one), leaving the queue as [1] and the stack as [0]. It will go back to sleep and the secondary thread will insert a 2 in each variable, leaving the queue as [1, 2] and the stack as [0, 2]. Then, the main thread will pop two elements from each of them. From the queue, it will pop the 1 and the 2, while from the stack it will pop the 2 and then the 0, leaving both empty. The main thread then goes to sleep, and for the next two tries, the secondary thread will push one element and the main thread will pop it, twice. It might seem a little bit complex, but the idea is that these queues and stacks can be used efficiently between threads without requiring a Mutex, and they accept any Send type. This means that they are great for complex computations, and even for multi-staged complex computations. The Crossbeam crate also has some helpers to deal with epochs and even some variants of the mentioned types. For multithreading, Crossbeam also adds a great utility: scoped threads. Scoped threads In all our examples, we have used standard library threads. As we have discussed, these threads have their own stack, so if we want to use variables that we created in the main thread we will need to send them to the thread. This means that we will need to use things such as Arc to share non-mutable data. Not only that, having their own stack means that they will also consume more memory and eventually make the system slower if they use too much. Crossbeam gives us some special threads that allow sharing stacks between them. They are called scoped threads. Using them is pretty simple and the crate documentation explains them perfectly; you will just need to create a Scope by calling crossbeam::scope(). You will need to pass a closure that receives the Scope. You can then call spawn() in that scope the same way you would do it in std::thread, but with one difference, you can share immutable variables among threads if they were created inside the scope or moved to it. This means that for the queues or stacks we just talked about, or for atomic data, you can simply call their methods without requiring an Arc! This will improve the performance even further. Let's see how it works with a simple example: extern crate crossbeam; fn main() { let all_nums: Vec<_> = (0..1_000_u64).into_iter().collect(); let mut results = Vec::new(); crossbeam::scope(|scope| { for num in &all_nums { results.push(scope.spawn(move || num * num + num * 5 + 250)); } }); let final_result: u64 = results.into_iter().map(|res| res.join()).sum(); println!("Final result: {}", final_result); } Let's see what this code does. It will first just create a vector with all the numbers from 0 to 1000. Then, for each of them, in a crossbeam scope, it will run one scoped thread per number and perform a supposedly complex computation. This is just an example, since it will just return a result of a simple second-order function. Interestingly enough, though, the scope.spawn() method allows returning a result of any type, which is great in our case. The code will add each result to a vector. This won't directly add the resulting number, since it will be executed in parallel. It will add a result guard, which we will be able to check outside the scope. Then, after all the threads run and return the results, the scope will end. We can now check all the results, which are guaranteed to be ready for us. For each of them, we just need to call join() and we will get the result. Then, we sum it up to check that they are actual results from the computation. This join() method can also be called inside the scope and get the results, but it will mean that if you do it inside the for loop, for example, you will block the loop until the result is generated, which is not efficient. The best thing is to at least run all the computations first and then start checking the results. If you want to perform more computations after them, you might find it useful to run the new computation in another loop or iterator inside the crossbeam scope. But, how does crossbeam allow you to use the variables outside the scope freely? Won't there be data races? Here is where the magic happens. The scope will join all the inner threads before exiting, which means that no further code will be executed in the main thread until all the scoped threads finish. This means that we can use the variables of the main thread, also called parent stack, due to the main thread being the parent of the scope in this case without any issue. We can actually check what is happening by using the println!() macro. If we remember from previous examples, printing to the console after spawning some threads would usually run even before the spawned threads, due to the time it takes to set them up. In this case, since we have crossbeam preventing it, we won't see it. Let's check the example: extern crate crossbeam; fn main() { let all_nums: Vec<_> = (0..10).into_iter().collect(); crossbeam::scope(|scope| { for num in all_nums { scope.spawn(move || { println!("Next number is {}", num); }); } }); println!("Main thread continues :)"); } If you run this code, you will see something similar to the following output: As you can see, scoped threads will run without any particular order. In this case, it will first run the 1, then the 0, then the 2, and so on. Your output will probably be different. The interesting thing, though, is that the main thread won't continue executing until all the threads have finished. Therefore, reading and modifying variables in the main thread is perfectly safe. There are two main performance advantages with this approach; Arc will require a call to malloc() to allocate memory in the heap, which will take time if it's a big structure and the memory is a bit full. Interestingly enough, that data is already in our stack, so if possible, we should try to avoid duplicating it in the heap. Moreover, the Arc will have a reference counter, as we saw. And it will even be an atomic reference counter, which means that every time we clone the reference, we will need to atomically increment the count. This takes time, even more than incrementing simple integers. Most of the time, we might be waiting for some expensive computations to run, and it would be great if they just gave all the results when finished. We can still add some more chained computations, using scoped threads, that will only be executed after the first ones finish, so we should use scoped threads more often than normal threads, if possible. Using thread pool So far, we have seen multiple ways of creating new threads and sharing information between them. Nevertheless, the ideal number of threads we should spawn to do all the work should be around the number of virtual processors in the system. This means we should not spawn one thread for each chunk of work. Nevertheless, controlling what work each thread does can be complex, since you have to make sure that all threads have work to do at any given point in time. Here is where thread pooling comes in handy. The Threadpool crate will enable you to iterate over all your work and for each of your small chunks, you can call something similar to a thread::spawn(). The interesting thing is that each task will be assigned to an idle thread, and no new thread will be created for each task. The number of threads is configurable and you can get the number of CPUs with other crates. Not only that, if one of the threads panics, it will automatically add a new one to the pool. To see an example, first, let's add threadpool and num_cpus as dependencies in our Cargo.toml file.  Then, let's see an example code: extern crate num_cpus; extern crate threadpool; use std::sync::atomic::{AtomicUsize, Ordering}; use std::sync::Arc; use threadpool::ThreadPool; fn main() { let pool = ThreadPool::with_name("my worker".to_owned(), num_cpus::get()); println!("Pool threads: {}", pool.max_count()); let result = Arc::new(AtomicUsize::new(0)); for i in 0..1_0000_000 { let t_result = result.clone(); pool.execute(move || { t_result.fetch_add(i, Ordering::Relaxed); }); } pool.join(); let final_res = Arc::try_unwrap(result).unwrap().into_inner(); println!("Final result: {}", final_res); } This code will create a thread pool of threads with the number of logical CPUs in your computer. Then, it will add a number from 0 to 1,000,000 to an atomic usize, just to test parallel processing. Each addition will be performed by one thread. Doing this with one thread per operation (1,000,000 threads) would be really inefficient. In this case, though, it will use the appropriate number of threads, and the execution will be really fast. There is another crate that gives thread pools an even more interesting parallel processing feature: Rayon. Using parallel iterators If you can see the big picture in these code examples, you'll have realized that most of the parallel work has a long loop, giving work to different threads. It happened with simple threads and it happens even more with scoped threads and thread pools. It's usually the case in real life, too. You might have a bunch of data to process, and you can probably separate that processing into chunks, iterate over them, and hand them over to various threads to do the work for you. The main issue with that approach is that if you need to use multiple stages to process a given piece of data, you might end up with lots of boilerplate code that can make it difficult to maintain. Not only that, you might find yourself not using parallel processing sometimes due to the hassle of having to write all that code. Luckily, Rayon has multiple data parallelism primitives around iterators that you can use to parallelize any iterative computation. You can almost forget about the Iterator trait and use Rayon's ParallelIterator alternative, which is as easy to use as the standard library trait! Rayon uses a parallel iteration technique called work stealing. For each iteration of the parallel iterator, the new value or values get added to a queue of pending work. Then, when a thread finishes its work, it checks whether there is any pending work to do and if there is, it starts processing it. This, in most languages, is a clear source of data races, but thanks to Rust, this is no longer an issue, and your algorithms can run extremely fast and in parallel. Let's look at how to use it for an example similar to those we have seen in this chapter. First, add rayon to your Cargo.toml file and then let's start with the code: extern crate rayon; use rayon::prelude::*; fn main() { let result = (0..1_000_000_u64) .into_par_iter() .map(|e| e * 2) .sum::<u64>(); println!("Result: {}", result); } As you can see, this works just as you would write it in a sequential iterator, yet, it's running in parallel. Of course, running this example sequentially will be faster than running it in parallel thanks to compiler optimizations, but when you need to process data from files, for example, or perform very complex mathematical computations, parallelizing the input can give great performance gains. Rayon implements these parallel iteration traits to all standard library iterators and ranges. Not only that, it can also work with standard library collections, such as HashMap and Vec. In most cases, if you are using the iter() or into_iter() methods from the standard library in your code, you can simply use par_iter() or into_par_iter() in those calls and your code should now be parallel and work perfectly. But, beware, sometimes parallelizing something doesn't automatically improve its performance. Take into account that if you need to update some shared information between the threads, they will need to synchronize somehow, and you will lose performance. Therefore, multithreading is only great if workloads are completely independent and you can execute one without any dependency on the rest. If you found this article useful and would like to learn more such tips, head over to pick up this book, Rust High Performance, authored by Iban Eguia Moraza. Rust 1.28 is here with global allocators, nonZero types and more Java Multithreading: How to synchronize threads to implement critical sections and avoid race conditions Multithreading with Qt
Read more
  • 0
  • 0
  • 33775
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-eric-evans-at-domain-driven-design-europe-2019-explains-the-different-bounded-context-types-and-their-relation-with-microservices
Bhagyashree R
17 Dec 2019
9 min read
Save for later

Eric Evans at Domain-Driven Design Europe 2019 explains the different bounded context types and their relation with microservices

Bhagyashree R
17 Dec 2019
9 min read
The fourth edition of the Domain-Driven Design Europe 2019 conference was held early this year from Jan 31-Feb 1 at Amsterdam. Eric Evans, who is known for his book Domain-Driven Design: Tackling Complexity in Software kick-started the conference with a great talk titled "Language in Context". In his keynote, Evans explained some key Domain-driven design concepts including subdomains, context maps, and bounded context. He introduced some new concepts as well including bubble context, quaint context, patch on patch context, and more. He further talked about the relationship between the bounded context and microservices. Want to learn domain-driven design concepts in a practical way? Check out our book, Hands-On Domain-Driven Design with .NET Core by Alexey Zimarev. This book will guide you in involving business stakeholders when choosing the software you are planning to build for them. By figuring out the temporal nature of behavior-driven domain models, you will be able to build leaner, more agile, and modular systems. What is a bounded context? Domain-driven design is a software development approach that focuses on the business domain or the subject area. To solve problems related to that domain, we create domain models which are abstractions describing selected aspects of a domain. The terminology and concepts related to these models only make sense within a context. In domain-driven design, this is called bounded context.  Bounded context is one of the most important concepts in domain-driven design. Evans explained that bounded context is basically a boundary where we eliminate any kind of ambiguity. It is a part of the software where particular terms, definitions, and rules apply in a consistent way. Another important property of the bounded context is that a developer and other people in the team should be able to easily see that “boundary.” They should know whether they are inside or outside of the boundary.  Within this bounded context, we have a canonical context in which we explore different domain models, refine our language and develop ubiquitous language, and try to focus on the core domain. Evans says that though this is a very “tidy” way of creating software, this is not what we see in reality. “Nothing is that tidy! Certainly, none of the large software systems that I have ever been involved with,” he says. He further added that though the concept of bounded context has grabbed the interest of many within the community, it is often “misinterpreted.” Evans has noticed that teams often confuse between bounded context and subdomain. The reason behind this confusion is that in an “ideal” scenario they should coincide. Also, large corporations are known for reorganizations leading to changes in processes and responsibilities. This could result in two teams having to work in the same bounded contexts with an increased risk of ending up with a “big ball of mud.” The different ways of describing bounded contexts In their paper, Big Ball of Mud, Brian Foote and Joseph Yoder describe the big ball of mud as “a haphazardly structured, sprawling, sloppy, duct-tape and baling wire, spaghetti code jungle.” Some of the properties that Evans uses to describe it are incomprehensible interdependencies, inconsistent definitions, incomplete coverage, and risky to change. Needless to say, you would want to avoid the big ball of mud by all accounts. However, if you find yourself in such a situation, Evans says that building the system from the ground up is not an ideal solution. Instead, he suggests going for something called bubble context in which you create a new model that works well next to the already existing models. While the business is run by the big ball of mud, you can do an elegant design within that bubble. Another context that Evans explained was the mature productive context. It is the part of the software that is producing value but probably is built on concepts in the core domain that are outdated. He explained this particular context with an example of a garden. A “tidy young garden” that has been recently planted looks great, but you do not get much value from it. It is only a few months later when the plants start fruition and you get the harvest. Along similar lines, developers should plant seeds with the goal of creating order, but also embrace the chaotic abundance that comes with a mature system. Evans coined another term quaint context for a context that one would consider "legacy". He describes it as an old context that still does useful work but is implemented using old fashioned technology or is not aligned with the current domain vision. Another name he suggests is patch on patch context that also does something useful as it is, but its numerous interdependency “makes change risky and expensive.” Apart from these, there are many other types of context that we do not explicitly label. When you are establishing a boundary, it is good practice to analyze different subdomains and check the ones that are generic and ones that are specific to the business. Here he introduced the generic subdomain context. “Generic here means something that everybody does or a great range of businesses and so forth do. There’s nothing special about our business and we want to approach this is a conventional way. And to do that the best way I believe is to have a context, a boundary in which we address that problem,” he explains. Another generic context Evans mentioned was generic off the shelf (OTS), which can make setting the boundary easier as you are getting something off the shelf. Bounded context types in the microservice architecture Evans sees microservices as the biggest opportunity and risks the software engineering community has had in a long time. Looking at the hype around microservices it is tempting to jump on the bandwagon, but Evans suggests that it is important to see the capabilities microservices provide us to meet the needs of the business. A common misconception people have is that microservices are bounded context, which Evans calls oversimplification. He further shared four kinds of context that involve microservices: Service internal The first one is service internal that describes how a service actually works. Evans believes that this is the type of context that people think of when they say microservice is a bounded context. In this context, a service is isolated from other services and handled by an autonomous team. Though this definitely fits the definition of a bounded context, it is not the only aspect of microservices, Evans notes. If we only use this type, we would end up with a bunch of services that don't know how to interact with each other.  API of Service  The API of service context describes how a service talks to other services. In this context as well, an API is built by an autonomous team and anyone consuming their API is required to conform to them. This implies that all the development decisions are pretty much dictated by the data flow direction, however, Evans think there are other alternatives. Highly influential groups may create an API that other teams must conform to irrespective of the direction data is flowing. Cluster of codesigned services The cluster of codesigned services context refers to the cluster of services designed in close collaboration. Here, the bounded context consists of a cluster of services designed to work with each other to accomplish some tasks. Evans remarks that the internals of the individual services could be very different from the models used in the API. Interchange context The final type is interchange context. According to Evans, the interaction between services must also be modeled. The model will describe messages and definitions to use when services interact with other services. He further notes that there are no services in this context as it is all about messages, schemas, and protocols. How legacy systems can participate in microservices architecture Coming back to legacy systems and how they can participate in a microservices environment, Evans introduced a new concept called Exposed Legacy Asset. He suggests creating an interface that looks like a microservice and interacts with other microservices, but internally interacts with a legacy system. This will help us avoid corrupting the new microservices built and also keeps us from having to change the legacy system. In the end, looking back at 15 years of his book, Domain-Driven Design, he said that we now may need a new definition of domain-driven design. A challenge that he sees is how tight this definition should be. He believes that a definition should share a common vision and language, but also be flexible enough to encourage innovation and improvement. He doesn’t want the domain-driven design to become a club of happy members. He instead hopes for an intellectually honest community of practitioners who are “open to the possibility of being wrong about things.” If you tried to take the domain-driven design route and you failed at some point, it is important to question and reexamine. Finally, he summarized by defining domain-driven design as a set of guiding principles and heuristics. The key principles are focussing on the core domain, exploring models in a creative collaboration of domain experts and software experts, and speaking a ubiquitous language within a bounded context. [box type="shadow" align="" class="" width=""] “Let's practice DDD together, shake it up and renew,” he concludes. [/box] If you want to put these and other domain-driven design principles into practice, grab a copy of our book, Hands-On Domain-Driven Design with .NET Core by Alexey Zimarev. This book will help you discover and resolve domain complexity together with business stakeholders and avoid common pitfalls when creating the domain model. You will further study the concept of bounded context and aggregate, and much more. Gabriel Baptista on how to build high-performance software architecture systems with C# and .Net Core You can now use WebAssembly from .NET with Wasmtime! Exploring .Net Core 3.0 components with Mark J. Price, a Microsoft specialist
Read more
  • 0
  • 0
  • 33764

article-image-15-things-every-bi-professional-should-know-about-tableau
Fatema Patrawala
17 Dec 2019
8 min read
Save for later

15 things every BI professional should know about Tableau

Fatema Patrawala
17 Dec 2019
8 min read
“The art and practice of visualizing data is becoming ever more important in bridging the human-computer gap to mediate analytical insight in a meaningful way.” ―Edd Dumbill Tableau is a powerful data visualization and discovery tool. It is an important part of a data analyst or data scientist’s - skill set, with many organizations specifying it as a key skill in job adverts. In this article, we’ll take a look at few things in Tableau you need to know to successfully make a mark in your business intelligence career. While architecture of traditional BI tools has hardware limitations, Tableau does not have such dependencies and it can function independently and requires minimum hardware support. Traditional tools are based on a complex set of technologies when Tableau is based on Associative Search technology making it intuitive, fast and dynamic. Tableau supports in-memory, multi-thread and multi-core computing and more advanced capabilities while traditional BI tools do not offer such functionalities. Various Tableau products Tableau Desktop is a self service business analytics and data visualization suite that anyone can use. With tableau desktop, you can extract massive data offline from your data warehouse for live up to date data analysis. Tableau Online / Tableau Server is an online hosting platform designed for enterprise users. It lets users working on Tableau publish and share dashboards across organization and teams. Tableau Reader is a free desktop application that enables you to open and view visualizations that are built in Tableau Desktop. Tableau Public is a free Tableau software which you can use to make visualizations but you will need to save your workbook or worksheets in the Tableau Server for anyone else to view them. Different data types in Tableau All fields in a data source have a data type. The data type reflects the kind of information stored in that field, for example integers (410), dates (1/23/2015) and strings (“Wisconsin”). The data type of a field is identified in the Data pane by one of the icons shown below. Data type icons in Tableau Icon Data type Text (string) values Date values Date & Time values Numerical values Boolean values (relational only) for example True/False Geographic values (used with maps) Cluster Group   Source: Tableau website Measures and Dimensions in Tableau Measures contain numeric, quantitative values that you can measure. Measures can be aggregated. When you drag a measure into the view, Tableau applies an aggregation to that measure (by default). Dimensions, on the other hand, contain qualitative values (such as names, dates, or geographical data). You can use dimensions to categorize, segment, and reveal the details in your data. Dimensions affect the level of detail in the view. Ways to connect data in Tableau We can either connect live to your data set or extract data into Tableau. Live: Connecting live to a data set leverages its computational processing and storage. New queries will go to the database and will be reflected as new or updated within the data. Extract: The Extract API allows you to programmatically extract and combine any data sources for use in Tableau. There can be multiple data source connections to different sources in the same workbook. Each connection will show up under the Data tab on the left sidebar. The benefit of Tableau extract over live connection is that extract can be used anywhere without any connection and you can build your own visualization without connecting to database. You can read a complete section on how to extract data in Tableau from this book, Learning Tableau 2019 - Third Edition, written by Joshua Milligan. This book takes you through the foundations of the Tableau 2019 paradigm to the advanced topics.  Joins and Blends in Tableau Joining tables and blending data sources are two different ways to link related data together in Tableau. Joins are performed to link tables of data together on a row-by-row basis. Blends are performed to link together multiple data sources at an aggregate level.  Different filters in Tableau and different use cases in which these filters are more relevant than others In Tableau, filters are used to restrict the data from database. Often, you will want to filter data in Tableau in order to perform an analysis on a subset of data, narrow your focus, or drill into detail. Tableau offers multiple ways to filter data. If you want to limit the scope of your analysis to a subset of data, you can filter the data at the source using one of the following techniques: Data Source Filters are applied before all other filters and are useful when you want to limit your analysis to a subset of data. These filters are applied before any other filters. Extract Filters limit the data that is stored in an extract (.tde or .hyper). Data source filters are often converted into extract filters if they are present when you extract the data. Custom SQL Filters can be accomplished using a live connection with custom SQL, which has a Tableau parameter in the WHERE clause.    Dual axis in Tableau Dual Axis is an excellent phenomenon supported by Tableau that helps users view two scales of two measures in the same graph. Many websites like Indeed.com and other make use of dual axis to show the comparison between two measures and their growth rate in a septic set of years. Dual axis let you compare multiple measures at once, having two independent axis layered on top of one another.  Key components of a Tableau Dashboard Horizontal – Horizontal layout containers allow the designer to group worksheets and dashboard components left to right across your page and edit the height of all elements at once. Vertical – Vertical containers allow the user to group worksheets and dashboard components top to bottom down your page and edit the width of all elements at once. Text – All textual fields. Image Extract  – A Tableau workbook is in XML format. In order to extract images, Tableau applies some codes to extract an image which can be stored in XML. Web [URL ACTION] – A URL action is a hyperlink that points to a Web page, file, or other web-based resource outside of Tableau. You can use URL actions to link to more information about your data that may be hosted outside of your data source. To make the link relevant to your data, you can substitute field values of a selection into the URL as parameters. If you want to learn how to design dashboards in Tableau, this book Learning Tableau 2019, will give you a step by step process for designing dashboards.  Why automate reports in Tableau Once you have automated reporting, you’ll have time to spend on innovative projects. What can be done manually could be performed by automation, delivering the same results in a fraction of the time. Reducing such a time-consuming and repetitive task will make you more productive, and more efficient.  What is story in Tableau? Why would create a story and what are they used for? A story is a sheet that contains a sequence of worksheets or dashboards that work together to convey information. You can create stories to show how facts are connected, provide context, demonstrate how decisions relate to outcomes, or simply make a compelling case. Each individual sheet in a story is called a story point. The primary objective of creating stories in Tableau is to communicate data to a certain audience with an intended result.  How can you create stories in Tableau? There is a feature in Tableau named as Stories that allows you to tell a story using interactive snapshots of dashboards and views. The snapshots become points in a story. This allows you to construct guided narrative or even an entire presentation. Read this chapter, ‘Telling a Data Story with Dashboards’ from this book, Learning Tableau 2019, to create insightful dashboards in Tableau.    How to embed views into Webpages? You can embed interactive Tableau views and dashboards into web pages, blogs, wiki pages, web applications, and intranet portals. Embedded views update as the underlying data changes, or as their workbooks are updated on Tableau Server. Embedded views follow the same licensing and permission restrictions used on Tableau Server. That is, to see a Tableau view that’s embedded in a web page, the person accessing the view must also have an account on Tableau Server. Alternatively, if your organization uses a core-based license on Tableau Server, a Guest account is available. This allows people in your organization to view and interact with Tableau views embedded in web pages without having to sign in to the server. Contact your server or site administrator to find out if the Guest user is enabled for the site you publish to.  What is Tableau Prep? Can we clean messy data with Tableau? Tableau Prep extends the Tableau platform with robust options for cleaning and structuring data for analysis in Tableau. In the same way that Tableau Desktop provides a hands-on, visual experience for visualizing and analyzing data, Tableau Prep provides a hands-on, visual experience for cleaning and shaping data. If you wish to know more about Tableau Prep or how to clean messy data to create powerful data visualizations and unlock intelligent business insights, read this book Learning Tableau 2019, written by Joshua N. Milligan. ‘Tableau Day’ highlights: Augmented Analytics, Tableau Prep Builder and Conductor, and more! Alteryx vs. Tableau: Choosing the right data analytics tool for your business How to do data storytelling well with Tableau [Video]
Read more
  • 0
  • 0
  • 33762

article-image-how-to-integrate-a-medium-editor-in-angular-8
Guest Contributor
05 Sep 2019
5 min read
Save for later

How to integrate a Medium editor in Angular 8

Guest Contributor
05 Sep 2019
5 min read
In the world of text editing, there is a new era of WYSIWYG (What You See Is What You Get). We all know how styling and formatting become the important elements of your website but most of the times it is tough to pick a simple, easy-to-use and powerful editor. Currently, the good days are coming back with the new Medium Editor! Medium Editor is an independent Javascript library to make a coasting content manager bar which springs up when you select a bit of content of your page which is enlivened by the magnificence of Medium.com. You can turn every field from a message on your contact form to a whole article on the back-end into a professionally styled text paragraph contains quote blocks, heading, hyperlinks or just a few selected words. You can also try to incorporate text editor into Angular 8 for the ease of updates and edits of your content. Angular 8 has released its latest feature - beta 6 with the attractive new functionalities for testing your software and fixing bugs. One of them is Bazel - Google's open-source part of its internal build tool called Blaze which is capable of performing incremental builds and tests. Let us check how you can integrate a Medium editor in the Angular 8 platform. Also Read: Angular CLI 8.3.0 releases with a new deploy command, faster production builds, and more Steps to create an editor using Angular 8 Step 1: First thing first, Create a project in Angular and you can also make use of bootstrap for making it look pretty good by adding CDN links in the index.html After entering the above-given command line, it will generate an angular starter application after it has completed installing all the dependencies. Step 2: Install an npm package by entering the below-given line. And then, include the CSS and js in angular.json file Step 3: Create a component with your chosen name and then create the one with the name create Step 4: Click to the newly created component.html and make a div by giving it a template reference of the name Try the above-given code snippet under a few bootstrap classes just to give a basic stylings Step 5: Select your component class and make a variable editor to view the child property as listed below: Step 6: Then, we will make use of one lifecycle hook of angular which is ngAfterViewInit. Paste the above-given scrap and you may get a mistake like media supervisor that isn't characterized all things considered, in this way, we have to declare it on the top like In the wake of rolling out the above improvements, you can, for the most part, make a little medium-supervisor to utilize it for yourself. You can pick over to compose anything and simply select the content you have written to see the enchantment. Step 7: After this, you may require some more alternatives in your editorial manager toolbar. For doing as such, you have to pass an arrangement object in the MediumEditor Constructor. By making the selective changes, you will be able to see a load of available options. Step 8: So, now you have got an editor, you can easily get the data from it. If someone writes a post then you need to have an HTML write of the same. Once more, you have to partition the screen into two parts wherein one half, there will be a supervisor and the other half will show the see of the post. [box type="shadow" align="" class="" width=""]In the second half of the screen, you need to assign the value of inner HTML as given above[/box] Wrap Up Every system is prone to pros and cons and so does the Angular too. Angular offers a clean code development along with the high-performance framework that manages to route, providing seamless updates using Command Line Interface and retrieving the state of location services. Also, you can debug the templates in Angular 8 and supports multiple applications in one domain. Contrary to this, Angular might be confusing for the newcomers as there is no accurate manual which includes the proper documentation of the framework. Also, it lacks the developer community and there is limited scope to debug Limited Routing. However, Angular 8 supports multiple applications in one domain and user-friendly for all the versions of the operating system. So, here we come to the end of the article. We hope you have gained information on how to integrate the latest medium editor to Angular 8. Do give it a try! Till then - keep learning! Author Bio Dave Jarvis is working as a Business Development Executive at eTatvaSoft.com, an enterprise-level mobile & web application development company. He aims to sharpen his analytical skills, deepening his data understanding and broaden his business knowledge in these years of his career. Click here to find more information about the company. Follow him on Twitter. Other interesting news in Web development Google Chrome 76 now supports native lazy-loading Laravel 6.0 releases with Laravel vapor compatibility, LazyCollection, improved authorization response and more #Reactgate forces React leaders to confront the community’s toxic culture head on
Read more
  • 0
  • 0
  • 33742

article-image-how-to-build-and-deploy-microservices-using-payara-micro
Gebin George
28 Mar 2018
9 min read
Save for later

How to build and deploy Microservices using Payara Micro

Gebin George
28 Mar 2018
9 min read
Payara Micro offers a new way to run Java EE or microservice applications. It is based on the Web profile of Glassfish and bundles few additional APIs. The distribution is designed keeping modern containerized environment in mind. Payara Micro is available to download as a standalone executable JAR, as well as a Docker image. It's an open source MicroProfile compatible runtime. Today, we will learn to use payara micro to build and deploy microservices. Here’s a list of APIs that are supported in Payara Micro: Servlets, JSTL, EL, and JSPs WebSockets JSF JAX-RS Chapter 4 [ 91 ] EJB lite JTA JPA Bean Validation CDI Interceptors JBatch Concurrency JCache We will be exploring how to build our services using Payara Micro in the next section. Building services with Payara Micro Let's start building parts of our Issue Management System (IMS), which is going to be a one-stop-destination for collaboration among teams. As the name implies, this system will be used for managing issues that are raised as tickets and get assigned to users for resolution. To begin the project, we will identify our microservice candidates based on the business model of IMS. Here, let's define three functional services, which will be hosted in their own independent Git repositories: ims-micro-users ims-micro-tasks ims-micro-notify You might wonder, why these three and why separate repositories? We could create much more fine-grained services and perhaps it wouldn't be wrong to do so. The answer lies in understanding the following points: Isolating what varies: We need to be able to independently develop and deploy each unit. Changes to one business capability or domain shouldn't require changes in other services more often than desired. Organisation or Team structure: If you define teams by business capability, then they can work independent of others and release features with greater agility. The tasks team should be able to evolve independent of the teams that are handling users or notifications. The functional boundaries should allow independent version and release cycle management. Transactional boundaries for consistency: Distributed transactions are not easy, thus creating services for related features that are too fine grained, and lead to more complexity than desired. You would need to become familiar with concepts like eventual consistency, but these are not easy to achieve in practice. Source repository per service: Setting up a single repository that hosts all the services is ideal when it's the same team that works on these services and the project is relatively small. But we are building our fictional IMS, which is a large complex system with many moving parts. Separate teams would get tightly coupled by sharing a repository. Moreover, versioning and tagging of releases will be yet another problem to solve. The projects are created as standard Java EE projects, which are Skinny WARs, that will be deployed using the Payara Micro server. Payara Micro allows us to delay the decision of using a Fat JAR or Skinny WAR. This gives us flexibility in picking the deployment choice at a later stage. As Maven is a widely adopted build tool among developers, we will use the same to create our example projects, using the following steps: mvn archetype:generate -DgroupId=org.jee8ng -DartifactId=ims-micro-users - DarchetypeArtifactId=maven-archetype-webapp -DinteractiveMode=false mvn archetype:generate -DgroupId=org.jee8ng -DartifactId=ims-micro-tasks - DarchetypeArtifactId=maven-archetype-webapp -DinteractiveMode=false mvn archetype:generate -DgroupId=org.jee8ng -DartifactId=ims-micro-notify - DarchetypeArtifactId=maven-archetype-webapp -DinteractiveMode=false Once the structure is generated, update the properties and dependencies section of pom.xml with the following contents, for all three projects: <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> <failOnMissingWebXml>false</failOnMissingWebXml> </properties> <dependencies> <dependency> <groupId>javax</groupId> <artifactId>javaee-api</artifactId> <version>8.0</version> <scope>provided</scope> </dependency> Chapter 4 [ 93 ] <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> <scope>test</scope> </dependency> </dependencies> Next, create a beans.xml file under WEB-INF folder for all three projects: <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://xmlns.jcp.org/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee http://xmlns.jcp.org/xml/ns/javaee/beans_2_0.xsd" bean-discovery-mode="all"> </beans> You can delete the index.jsp and web.xml files, as we won't be needing them. The following is the project structure of ims-micro-users. The same structure will be used for ims-micro-tasks and ims-micro-notify: The package name for users, tasks, and notify service will be as shown as the following: org.jee8ng.ims.users (inside ims-micro-users) org.jee8ng.ims.tasks (inside ims-micro-tasks) org.jee8ng.ims.notify (inside ims-micro-notify) Each of the above will in turn have sub-packages called boundary, control, and entity. The structure follows the Boundary-Control-Entity (BCE)/Entity-Control-Boundary (ECB) pattern. The JaxrsActivator shown as follows is required to enable the JAX-RS API and thus needs to be placed in each of the projects: import javax.ws.rs.ApplicationPath; import javax.ws.rs.core.Application; @ApplicationPath("resources") public class JaxrsActivator extends Application {} All three projects will have REST endpoints that we can invoke over HTTP. When doing RESTful API design, a popular convention is to use plural names for resources, especially if the resource could represent a collection. For example: /users /tasks The resource class names in the projects use the plural form, as it's consistent with the resource URL naming used. This avoids confusions such as a resource URL being called a users resource, while the class is named UserResource. Given that this is an opinionated approach, feel free to use singular class names if desired. Here's the relevant code for ims-micro-users, ims-micro-tasks, and ims-micronotify projects respectively. Under ims-micro-users, define the UsersResource endpoint: package org.jee8ng.ims.users.boundary; import javax.ws.rs.*; import javax.ws.rs.core.*; @Path("users") public class UsersResource { @GET Chapter 4 [ 95 ] @Produces(MediaType.APPLICATION_JSON) public Response get() { return Response.ok("user works").build(); } } Under ims-micro-tasks, define the TasksResource endpoint: package org.jee8ng.ims.tasks.boundary; import javax.ws.rs.*; import javax.ws.rs.core.*; @Path("tasks") public class TasksResource { @GET @Produces(MediaType.APPLICATION_JSON) public Response get() { return Response.ok("task works").build(); } } Under ims-micro-notify, define the NotificationsResource endpoint: package org.jee8ng.ims.notify.boundary; import javax.ws.rs.*; import javax.ws.rs.core.*; @Path("notifications") public class NotificationsResource { @GET @Produces(MediaType.APPLICATION_JSON) public Response get() { return Response.ok("notification works").build(); } } Once you build all three projects using mvn clean install, you will get your Skinny WAR files generated in the target directory, which can be deployed on the Payara Micro server. Running services with Payara Micro Download the Payara Micro server if you haven't already, from this link: https://www.payara.fish/downloads. The micro server will have the name payara-micro-xxx.jar, where xxx will be the version number, which might be different when you download the file. Here's how you can start Payara Micro with our services deployed locally. When doing so, we need to ensure that the instances start on different ports, to avoid any port conflicts: >java -jar payara-micro-xxx.jar --deploy ims-micro-users/target/ims-microusers. war --port 8081 >java -jar payara-micro-xxx.jar --deploy ims-micro-tasks/target/ims-microtasks. war --port 8082 >java -jar payara-micro-xxx.jar --deploy ims-micro-notify/target/ims-micronotify. war --port 8083 This will start three instances of Payara Micro running on the specified ports. This makes our applications available under these URLs: http://localhost:8081/ims-micro-users/resources/users/ http://localhost:8082/ims-micro-tasks/resources/tasks/ http://localhost:8083/ims-micro-notify/resources/notifications/ Payar Micro can be started on a non-default port by using the --port parameter, as we did earlier. This is useful when running multiple instances on the same machine. Another option is to use the --autoBindHttp parameter, which will attempt to connect on 8080 as the default port, and if that port is unavailable, it will try to bind on the next port up, repeating until it finds an available port. Examples of starting Payara Micro: Uber JAR option: Now, there's one more feature that Payara Micro provides. We can generate an Uber JAR as well, which would be the Fat JAR approach that we learnt in the Fat JAR section. To package our ims-micro-users project as an Uber JAR, we can run the following command: java -jar payara-micro-xxx.jar --deploy ims-micro-users/target/ims-microusers. war --outputUberJar users.jar This will generate the users.jar file in the directory where you run this command. The size of this JAR will naturally be larger than our WAR file, since it will also bundle the Payara Micro runtime in it. Here's how you can start the application using the generated JAR: java -jar users.jar The server parameters that we used earlier can be passed to this runnable JAR file too. Apart from the two choices we saw for running our microservice projects, there's a third option as well. Payara Micro provides an API based approach, which can be used to programmatically start the embedded server. We will expand upon these three services as we progress further into the realm of cloud based Java EE. We saw how to leverage the power of Payara Micro to run Java EE or microservice applications. You read an excerpt from the book, Java EE 8 and Angular written by Prashant Padmanabhan. This book helps you build high-performing enterprise applications using Java EE powered by Angular at the frontend.  
Read more
  • 0
  • 0
  • 33742
article-image-using-registry-and-xlswriter-modules
Packt
14 Apr 2016
12 min read
Save for later

Using the Registry and xlswriter modules

Packt
14 Apr 2016
12 min read
In this article by Chapin Bryce and Preston Miller, the authors of Learning Python for Forensics, we will learn about the features offered by the Registry and xlswriter modules. (For more resources related to this topic, see here.) Working with the Registry module The Registry module, developed by Willi Ballenthin, can be used to obtain keys and values from registry hives. Python provides a built-in registry module called _winreg; however, this module only works on Windows machines. The _winreg module interacts with the registry on the system running the module. It does not support opening external registry hives. The Registry module allows us to interact with the supplied registry hives and can be run on non-Windows machines. The Registry module can be downloaded from https://github.com/williballenthin/python-registry. Click on the releases section to see a list of all the stable versions and download the latest version. For this article, we use version 1.1.0. Once the archived file is downloaded and extracted, we can run the included setup.py file to install the module. In a command prompt, execute the following code in the module's top-level directory as shown: python setup.py install This should install the Registry module successfully on your machine. We can confirm this by opening the Python interactive prompt and typing import Registry. We will receive an error if the module is not installed successfully. With the Registry module installed, let's begin to learn how we can leverage this module for our needs. First, we need to import the Registry class from the Registry module. Then, we use the Registry function to open the registry object that we want to query. Next, we use the open() method to navigate to our key of interest. In this case, we are interested in the RecentDocs registry key. This key contains recent active files separated by extension as shown: >>> from Registry import Registry >>> reg = Registry.Registry('NTUSER.DAT') >>> recent_docs = reg.open('SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\RecentDocs') If we print therecent_docs variable, we can see that it contains 11 values with five subkeys, which may contain additional values and subkeys. Additionally, we can use thetimestamp() method to see the last written time of the registry key. >>> print recent_docs Registry Key CMI-CreateHive{B01E557D-7818-4BA7-9885-E6592398B44E}SoftwareMicrosoftWindowsCurrentVersionExplorerRecentDocs with 11 values and 5 subkeys >>> print recent_docs.timestamp() # Last Written Time 2012-04-23 09:34:12.099998 We can iterate over the values in the recent_docs key using the values() function in a for loop. For each value, we can access the name(), value(), raw_data(), value_type(), and value_type_str() methods. The value() and raw_data() represent the data in different ways. We will use the raw_data() function when we want to work with the underlying binary data and use the value() function to gather an interpreted result. The value_type() and value_type_str() functions display a number or string that identify the type of data, such as REG_BINARY, REG_DWORD, REG_SZ, and so on. >>> for i, value in enumerate(recent_docs.values()): ... print '{}) {}: {}'.format(i, value.name(), value.value()) ... 0) MRUListEx: ???? 1) 0: myDocument.docx 2) 4: oldArchive.zip 3) 2: Salaries.xlsx ... Another useful feature of the Registry module is the means provided for querying for a certain subkey or value. This is provided by the subkey(), value(), or find_key() functions. A RegistryKeyNotFoundException is generated when a subkey is not present while using the subkey() function: >>> if recent_docs.subkey('.docx'): ... print 'Found docx subkey.' ... Found docx subkey. >>> if recent_docs.subkey('.1234abcd'): ... print 'Found 1234abcd subkey.' ... Registry.Registry.RegistryKeyNotFoundException: ... The find_key() function takes a path and can find a subkey through multiple levels. The subkey() and value() functions only search child elements. We can use these functions to confirm that a key or value exists before trying to navigate to them. If a particular key or value cannot be found, a custom exception from the Registry module is raised. Be sure to add error handling to catch this error and also alert the user that the key was not discovered. With the Registry module, finding keys and their values becomes straightforward. However, when the values are not strings and are instead binary data we have to rely on another module to make sense of the mess. For all binary needs, the struct module is an excellent candidate. Read also: Tools for Working with Excel and Python Creating Spreadsheets with the xlsxwriter Module Xlsxwriter is a useful third-party module that writes Excel output. There are a plethora of Excel-supported modules for Python, but we chose this module because it was highly robust and well-documented. As the name suggests, this module can only be used to write Excel spreadsheets. The xlsxwriter module supports cell and conditional formatting, charts, tables, filters, and macros among others. Adding data to a spreadsheet Let's quickly create a script called simplexlsx.v1.py for this example. On lines 1 and 2 we import the xlsxwriter and datetime modules. The data we are going to be plotting, including the header column is stored as nested lists in the school_data variable. Each list is a row of information that we want to store in the output excel sheet, with the first element containing the column names. 001 import xlsxwriter 002 from datetime import datetime 003 004 school_data = [['Department', 'Students', 'Cumulative GPA', 'Final Date'], 005 ['Computer Science', 235, 3.44, datetime(2015, 07, 23, 18, 00, 00)], 006 ['Chemistry', 201, 3.26, datetime(2015, 07, 25, 9, 30, 00)], 007 ['Forensics', 99, 3.8, datetime(2015, 07, 23, 9, 30, 00)], 008 ['Astronomy', 115, 3.21, datetime(2015, 07, 19, 15, 30, 00)]] The writeXLSX() function, defined on line 11, is responsible for writing our data in to a spreadsheet. First, we must create our Excel spreadsheet using the Workbook() function supplying the desired name of the file. On line 13, we create a worksheet using the add_worksheet() function. This function can take the desired title of the worksheet or use the default name 'Sheet N', where N is the specific sheet number. 011 def writeXLSX(data): 012 workbook = xlsxwriter.Workbook('MyWorkbook.xlsx') 013 main_sheet = workbook.add_worksheet('MySheet') The date_format variable stores a custom number format that we will use to display our datetime objects in the desired format. On line 17, we begin to enumerate through our data to write. The conditional on line 18 is used to handle the header column which is the first list encountered. We use the write() function and supply a numerical row and column. Alternatively, we can also use the Excel notation, i.e. A1. 015 date_format = workbook.add_format({'num_format': 'mm/dd/yy hh:mm:ss AM/PM'}) 016 017 for i, entry in enumerate(data): 018 if i == 0: 019 main_sheet.write(i, 0, entry[0]) 020 main_sheet.write(i, 1, entry[1]) 021 main_sheet.write(i, 2, entry[2]) 022 main_sheet.write(i, 3, entry[3]) The write() method will try to write the appropriate type for an object when it can detect the type. However, we can use different write methods to specify the correct format. These specialized writers preserve the data type in Excel so that we can use the appropriate data type specific Excel functions for the object. Since we know the data types within the entry list, we can manually specify when to use the general write() function or the specific write_number() function. 023 else: 024 main_sheet.write(i, 0, entry[0]) 025 main_sheet.write_number(i, 1, entry[1]) 026 main_sheet.write_number(i, 2, entry[2]) For the fourth entry in the list, thedatetime object, we supply the write_datetime() function with our date_format defined on line 15. After our data is written to the workbook, we use the close() function to close and save our data. On line 32, we call the writeXLSX() function passing it to the school_data list we built earlier. 027 main_sheet.write_datetime(i, 3, entry[3], date_format) 028 029 workbook.close() 030 031 032 writeXLSX(school_data) A table of write functions and the objects they preserve is presented below. Function Supported Objects write_string str write_number int, float, long write_datetime datetime objects write_boolean bool write_url str When the script is invoked at the Command Line, a spreadsheet called MyWorkbook.xlsx is created. When we convert this to a table, we can sort it according to any of our values. Had we failed to preserve the data types values such as our dates might be identified as non-number types and prevent us from sorting them appropriately. Building a table Being able to write data to an Excel file and preserve the object type is a step-up over CSV, but we can do better. Often, the first thing an examiner will do with an Excel spreadsheet is convert the data into a table and begin the frenzy of sorting and filtering. We can convert our data range to a table. In fact, writing a table with xlsxwriter is arguably easier than writing each row individually. The following code will be saved into the file simplexlsx.v2.py. For this iteration, we have removed the initial list in the school_data variable that contained the header information. Our new writeXLSX() function writes the header separately. 004 school_data = [['Computer Science', 235, 3.44, datetime(2015, 07, 23, 18, 00, 00)], 005 ['Chemistry', 201, 3.26, datetime(2015, 07, 25, 9, 30, 00)], 006 ['Forensics', 99, 3.8, datetime(2015, 07, 23, 9, 30, 00)], 007 ['Astronomy', 115, 3.21, datetime(2015, 07, 19, 15, 30, 00)]] Lines 10 through 14 are identical to the previous iteration of the function. Representing our table on the spreadsheet is accomplished on line 16. 010 def writeXLSX(data): 011 workbook = xlsxwriter.Workbook('MyWorkbook.xlsx') 012 main_sheet = workbook.add_worksheet('MySheet') 013 014 date_format = workbook.add_format({'num_format': 'mm/dd/yy hh:mm:ss AM/PM'}) The add_table() function takes multiple arguments. First, we pass a string representing the top-left and bottom-right cells of the table in Excel notation. We use the length variable, defined on line 15, to calculate the necessary length of our table. The second argument is a little more confusing; this is a dictionary with two keys, named data and columns. The data key has a value of our data variable, which is perhaps poorly named in this case. The columns key defines each row header and, optionally, its format, as seen on line 19: 015 length = str(len(data) + 1) 016 main_sheet.add_table(('A1:D' + length), {'data': data, 017 'columns': [{'header': 'Department'}, {'header': 'Students'}, 018 {'header': 'Cumulative GPA'}, 019 {'header': 'Final Date', 'format': date_format}]}) 020 workbook.close() In lesser lines than the previous example, we've managed to create a more useful output built as a table. Now our spreadsheet has our specified data already converted into a table and ready to be sorted. There are more possible keys and values that can be supplied during the construction of a table. Please consult the documentation at (http://xlsxwriter.readthedocs.org) for more details on advanced usage. This process is simple when we are working with nested lists representing each row of a worksheet. Data structures not in the specified format require a combination of both methods demonstrated in our previous iterations to achieve the same effect. For example, we can define a table to span across a certain number of rows and columns and then use the write() function for those cells. However, to prevent unnecessary headaches we recommend keeping data in nested lists. Creating charts with Python Lastly, let's create a chart with xlsxwriter. The module supports a variety of different chart types including: line, scatter, bar, column, pie, and area. We use charts to summarize the data in meaningful ways. This is particularly useful when working with large data sets, allowing examiners to gain a high level of understanding of the data before getting into the weeds. Let's modify the previous iteration yet again to display a chart. We will save this modified file as simplexlsx.v3.py. On line 21, we are going to create a variable called department_grades. This variable will be our chart object created by the add_chart()method. For this method, we pass in a dictionary specifying keys and values[SS4] . In this case, we specify the type of the chart to be a column chart. 021 department_grades = workbook.add_chart({'type':'column'}) On line 22, we use theset_title() function and again pass it in a dictionary of parameters. We set the name key equal to our desired title. At this point, we need to tell the chart what data to plot. We do this with the add_series() function. Each category key maps to the Excel notation specifying the horizontal axis data. The vertical axis is represented by the values key. With the data to plot specified, we use theinsert_chart() function to plot the data in the spreadsheet. We give this function a string of the cell to plot the top-left of the chart and then the chart object itself. 022 department_grades.set_title({'name':'Department and Grade distribution'}) 023 department_grades.add_series({'categories':'=MySheet!$A$2:$A$5', 'values':'=MySheet!$C$2:$C$5'}) 024 main_sheet.insert_chart('A8', department_grades) 025 workbook.close() Running this version of the script will convert our data into a table and generate a column chart comparing departments by their grades. We can clearly see that, unsurprisingly, the Forensic Science department has the highest GPA earners in the school's program. This information is easy enough to eyeball for such a small data set. However, when working with data orders of larger magnitude, creating summarizing graphics can be particularly useful to understand the big picture. Be aware that there is a great deal of additional functionality in the xlsxwriter module that we will not use in our script. This is an extremely powerful module and we recommend it for any operation that requires writing Excel spreadsheets. Summary In this article, we began with introducing the Registry module and how it is used to obtain keys and values from registry hives. Next, we dealt with various aspects of spreadsheets, such as cells, tables, and charts using the xlswriter module. Resources for Article: Further resources on this subject: Test all the things with Python [article] An Introduction to Python Lists and Dictionaries [article] Python Data Science Up and Running [article]
Read more
  • 0
  • 0
  • 33740

article-image-build-first-android-app-kotlin
Aarthi Kumaraswamy
13 Apr 2018
10 min read
Save for later

Build your first Android app with Kotlin

Aarthi Kumaraswamy
13 Apr 2018
10 min read
Android application with Kotlin is an area which shines. Before getting started on this journey, we must set up our systems for the task at hand. A major necessity for developing Android applications is a suitable IDE - it is not a requirement but it makes the development process easier. Many IDE choices exist for Android developers. The most popular are: Android Studio Eclipse IntelliJ IDE Android Studio is by far the most powerful of the IDEs available with respect to Android development. As a consequence, we will be utilizing this IDE in all Android-related chapters in this book. Setting up Android Studio At the time of writing, the version of Android Studio that comes bundled with full Kotlin support is Android Studio 3.0. The canary version of this software can be downloaded from this website. Once downloaded, open the downloaded package or executable and follow the installation instructions. A setup wizard exists to guide you through the IDE setup procedure: Continuing to the next setup screen will prompt you to choose which type of Android Studio setup you'd like: Select the Standard setup and continue to the next screen. Click Finish on the Verify Settings screen. Android Studio will now download the components required for your setup. You will need to wait a few minutes for the required components to download: Click Finish once the component download has completed. You will be taken to the Android Studio landing screen. You are now ready to use Android Studio: [box type="note" align="" class="" width=""]You may also want to read Benefits of using Kotlin Java for Android programming.[/box] Building your first Android application with Kotlin Without further ado, let's explore how to create a simple Android application with Android Studio. We will be building the HelloApp. The HelloApp is an app that displays Hello world! on the screen upon the click of a button. On the Android Studio landing screen, click Start a new Android Studio project. You will be taken to a screen where you will specify some details that concern the app you are about to build, such as the name of the application, your company domain, and the location of the project. Type in HelloApp as the application name and enter a company domain. If you do not have a company domain name, fill in any valid domain name in the company domain input box – as this is a trivial project, a legitimate domain name is not required. Specify the location in which you want to save this project and tick the checkbox for the inclusion of Kotlin support. After filling in the required parameters, continue to the next screen: Here, we are required to specify our target devices. We are building this application to run on smartphones specifically, hence tick the Phone and Tablet checkbox if it's not already ticked. You will notice an options menu next to each device option. This dropdown is used to specify the target API level for the project being created. An API level is an integer that uniquely identifies the framework API division offered by a version of the Android platform. Select API level 15 if not already selected and continue to the next screen: On the next screen, we are required to select an activity to add to our application. An activity is a single screen with a unique user interface—similar to a window. We will discuss activities in more depth in Chapter 2, Building an Android Application – Tetris. For now, select the empty activity and continue to the next screen. Now, we need to configure the activity that we just specified should be created. Name the activity HelloActivityand ensure the Generate Layout File and Backwards Compatibility checkboxes are ticked: Now, click the Finish button. Android Studio may take a few minutes to set up your project. Once the setup is complete, you will be greeted by the IDE window containing your project files. [box type="note" align="" class="" width=""]Errors pertaining to the absence of required project components may be encountered at any point during project development. Missing components can be downloaded from the SDK manager. [/box] Make sure that the project window of the IDE is open (on the navigation bar, select View | Tool Windows | Project) and the Android view is currently selected from the drop-down list at the top of the Project window. You will see the following files at the left-hand side of the window: app | java | com.mydomain.helloapp | HelloActivity.java: This is the main activity of your application. An instance of this activity is launched by the system when you build and run your application: app | res | layout | activity_hello.xml: The user interface for HelloActivity is defined within this XML file. It contains a TextView element placed within the ViewGroup of a ConstraintLayout. The text of the TextView has been set to Hello World! app | manifests | AndroidManifest.xml: The AndroidManifest file is used to describe the fundamental characteristics of your application. In addition, this is the file in which your application's components are defined. Gradle Scripts | build.gradle: Two build.gradle files will be present in your project. The first build.gradle file is for the project and the second is for the app module. You will most frequently work with the module's build.gradle file for the configuration of the compilation procedure of Gradle tools and the building of your app. [box type="note" align="" class="" width=""]Gradle is an open source build automation system used for the declaration of project configurations. In Android, Gradle is utilized as a build tool with the goal of building packages and managing application dependencies. [/box] Creating a user interface A user interface (UI) is the primary means by which a user interacts with an application. The user interfaces of Android applications are made by the creation and manipulation of layout files. Layout files are XML files that exist in app | res | layout. To create the layout for the HelloApp, we are going to do three things: Add a LinearLayout to our layout file Place the TextView within the LinearLayout and remove the android:text attribute it possesses Add a button to the LinearLayout Open the activity_hello.xml file if it's not already opened. You will be presented with the layout editor. If the editor is in the Design view, change it to its Text view by toggling the option at the bottom of the layout editor. Now, your layout editor should look similar to that of the following screenshot: ViewGroup that arranges child views in either a horizontal or vertical manner within a single column. Copy the code snippet of our required LinearLayout from the following block and paste it within the ConstraintLayout preceding the TextView: <LinearLayout android:id="@+id/ll_component_container" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical" android:gravity="center"> </LinearLayout> Now, copy and paste the TextView present in the activity_hello.xml file into the body of the LinearLayout element and remove the android:text attribute: <LinearLayout android:id="@+id/ll_component_container" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical" android:gravity="center"> <TextView android:id="@+id/tv_greeting" android:layout_width="wrap_content" android:layout_height="wrap_content"       android:textSize="50sp" /> </LinearLayout> Lastly, we need to add a button element to our layout file. This element will be a child of our LinearLayout. To create a button, we use the Button element: <LinearLayout android:id="@+id/ll_component_container" android:layout_width="match_parent" android:layout_height="match_parent" android:orientation="vertical" android:gravity="center"> <TextView android:id="@+id/tv_greeting" android:layout_width="wrap_content" android:layout_height="wrap_content"       android:textSize="50sp" /> <Button       android:id="@+id/btn_click_me" android:layout_width="wrap_content" android:layout_height="wrap_content" android:layout_marginTop="16dp" android:text="Click me!"/> </LinearLayout> Toggle to the layout editor's design view to see how the changes we have made thus far translate when rendered on the user interface: Now we have our layout, but there's a problem. Our CLICK ME! button does not actually do anything when clicked. We are going to fix that by adding a listener for click events to the button. Locate and open the HelloActivity.java file and edit the function to add the logic for the CLICK ME! button's click event as well as the required package imports, as shown in the following code: package com.mydomain.helloapp import android.support.v7.app.AppCompatActivity import android.os.Bundle import android.text.TextUtils import android.widget.Button import android.widget.TextView import android.widget.Toast class HelloActivity : AppCompatActivity() { override fun onCreate(savedInstanceState: Bundle?) { super.onCreate(savedInstanceState) setContentView(R.layout.activity_hello) val tvGreeting = findViewById<TextView>(R.id.tv_greeting) val btnClickMe = findViewById<Button>(R.id.btn_click_me) btnClickMe.setOnClickListener { if (TextUtils.isEmpty(tvGreeting.text)) { tvGreeting.text = "Hello World!" } else { Toast.makeText(this, "I have been clicked!",                       Toast.LENGTH_LONG).show() } } } } In the preceding code snippet, we have added references to the TextView and Button elements present in our activity_hello layout file by utilizing the findViewById function. The findViewById function can be used to get references to layout elements that are within the currently-set content view. The second line of the onCreate function has set the content view of HelloActivity to the activity_hello.xml layout. Next to the findViewById function identifier, we have the TextView type written between two angular brackets. This is called a function generic. It is being used to enforce that the resource ID being passed to the findViewById belongs to a TextView element. After adding our reference objects, we set an onClickListener to btnClickMe. Listeners are used to listen for the occurrence of events within an application. In order to perform an action upon the click of an element, we pass a lambda containing the action to be performed to the element's setOnClickListener method. When btnClickMe is clicked, tvGreeting is checked to see whether it has been set to contain any text. If no text has been set to the TextView, then its text is set to Hello World!, otherwise a toast is displayed with the I have been clicked! text. Running the Android application In order to run the application, click the Run 'app' (^R) button at the top-right side of the IDE window and select a deployment target. The HelloApp will be built, installed, and launched on the deployment target: You may use one of the available prepackaged virtual devices or create a custom virtual device to use as the deployment target.  You may also decide to connect a physical Android device to your computer via USB and select it as your target. The choice is up to you. After selecting a deployment device, click OK to build and run the application. Upon launching the application, our created layout is rendered: When CLICK ME! is clicked, Hello World! is shown to the user: Subsequent clicks of the CLICK ME! button display a toast message with the text I have been clicked!: You enjoyed an excerpt from the book, Kotlin Programming By Example by Iyanu Adelekan. Start building and deploying Android apps with Kotlin using this book. Check out other related posts: Creating a custom layout implementation for your Android app Top 5 Must-have Android Applications OpenCV and Android: Making Your Apps See      
Read more
  • 0
  • 0
  • 33730

article-image-how-to-perform-audio-video-image-scraping-with-python
Amarabha Banerjee
08 Mar 2018
9 min read
Save for later

How to perform Audio-Video-Image Scraping with Python

Amarabha Banerjee
08 Mar 2018
9 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server. Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally. Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It's a simple step from there to also transcode video with ffmpeg. Downloading media content from the web Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content. Getting ready There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file. How to do it Here is how we proceed with the recipe: The URLUtility class can download content from a URL. The code in the recipe's file is the following: import const from util.urls import URLUtility util = URLUtility(const.ApodEclipseImage()) print(len(util.data)) When running this you will see the following output:  Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014 The example reads 171014 bytes of data. How it works The URL is defined as a constant const.ApodEclipseImage() in the const module: def ApodEclipseImage(): return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg" The constructor of the URLUtility class has the following implementation: def __init__(self, url, readNow=True): """ Construct the object, parse the URL, and download now if specified""" self._url = url self._response = None self._parsed = urlparse(url) if readNow: self.read() The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method: def read(self): self._response = urllib.request.urlopen(self._url) self._data = self._response.read() This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property: @property def data(self): self.ensure_response() return self._data The code then simply reports on the length of that data, with the value of 171014. There's more This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next. Parsing a URL with urllib to get the filename When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name? Getting ready We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py. How to do it Execute the recipe's file with your python interpreter. It will run the following code: util = URLUtility(const.ApodEclipseImage()) print(util.filename_without_ext) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The filename is: BT5643s How it works In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively: >>> parsed = urlparse(const.ApodEclipseImage()) >>> parsed ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='') The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension: @property def filename_without_ext(self): filename = os.path.splitext(os.path.basename(self._parsed.path))[0] return filename The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename). There's more It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That's our next recipe. Determining the type of content for a URL When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content. Getting ready We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py. How to do it We proceed as follows: Execute the script for the recipe. It contains the following code: util = URLUtility(const.ApodEclipseImage()) print("The content type is: " + util.contenttype) With the following result: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The content type is: image/jpeg How it works The .contentype property is implemented as follows: @property def contenttype(self): self.ensure_response() return self._response.headers['content-type'] The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed. There's more The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned: >>> response = urllib.request.urlopen(const.ApodEclipseImage()) >>> for header in response.headers: print(header) Date Server Last-Modified ETag Accept-Ranges Content-Length Connection Content-Type Strict-Transport-Security And we can see the values for each of these headers. >>> for header in response.headers: print(header + " ==> " + response.headers[header]) Date ==> Tue, 26 Sep 2017 19:31:41 GMT Server ==> WebServer/1.0 Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT ETag ==> "547bb44-29c06-5581275ce2b86" Accept-Ranges ==> bytes Content-Length ==> 171014 Connection ==> close Content-Type ==> image/jpeg Strict-Transport-Security ==> max-age=31536000; includeSubDomains Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist. Determining the file extension from a content type It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file. Getting ready We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):. How to do it Proceed by running the recipe's script. An extension for the media type can be found using the .extension property: util = URLUtility(const.ApodEclipseImage()) print("Filename from content-type: " + util.extension_from_contenttype) print("Filename from url: " + util.extension_from_url) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes Filename from content-type: .jpg Filename from url: .jpg This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same. How it works The following is the implementation of the .extension_from_contenttype property: @property def extension_from_contenttype(self): self.ensure_response() map = const.ContentTypeToExtensions() if self.contenttype in map: return map[self.contenttype] return None The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension: def ContentTypeToExtensions(): return { "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/png": ".png" } If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url: @property def extension_from_url(self): ext = os.path.splitext(os.path.basename(self._parsed.path))[1] return ext This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename. To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python. If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.
Read more
  • 0
  • 0
  • 33710
article-image-transforming-web-data-with-browse-ai
Merlyn Shelley
26 Mar 2024
14 min read
Save for later

Transforming Web Data with Browse AI

Merlyn Shelley
26 Mar 2024
14 min read
Subscribe to our Data Pro newsletter for the latest insights. Don't miss out – sign up today!Partnering with Browse AI Turn Web Data into Your Business Superpower!👉 Train a robot in 2 minutes, no coding needed. 🤖 👉 Ideal for web scraping and data monitoring. 🌐 Here’s what you get: Monitor Websites for Changes ✅ Download Data from Any Website ✅ Turn Any Website into an API ✅ Product data extraction ✅ Also, extract data from news, stocks, jobs, social media, and more. Check out this 1-minute explainer video on how to extract data to Excel, Airtable, and connect to 5,000+ apps using Zapier! Start for free with up to 50 credits, and for a limited time, enjoy free setup and onboarding for Team and Company plans, saving up to 20% on Annual plans. Get Scraping Today!👋 Hello,Welcome to DataPro#85 – Your one-stop shop for the latest in Data Science and ML Algorithms! 🚀 In this issue:⚙️ Keeping Up with LLMs & GPTs  Meet Devin: The pioneering AI software engineer. Google's Croissant: A fresh take on metadata for ML-ready datasets. INSTRUCTIR by Kaist AI: Setting new standards in instruction-following for information retrieval models. Spyx by Sussex AI: Turbocharging spiking neural networks with just-in-time compiled optimization. SynCode by VMware: Enhancing LLM code generation with a touch of grammar. Chatbot Arena: The ultimate battleground for evaluating LLMs by human preference. Apollo: Bringing medical AI to the masses with a multilingual medical LLM. ✨ On the RadarTop AI tools for code generation in 2024. Setting up a Pypi mirror in AWS with Terraform. Ensuring safer code changes with custom pre-commit hooks. Deciphering the AQLM Quantization Algorithm. AI's role in revolutionizing web browsing. Tackling tensors through three tricky errors. Running RStudio inside a container. Harnessing PyTorch and MLX for Apple Silicon. 🏭 Industry Highlights Google Research: Boosting LLMs with Cappy, evolving tables with Chain-of-table, and Scalable Instructable Multiworld Agent (SIMA). AWS: Streamlining code review with generative AI using Amazon Bedrock. OpenAI Updates: Leadership continuity and global news partnerships. 📚 New in Packt Library Practical Guide to Applied Conformal Prediction in Python by Valery Manokhin. DataPro Newsletter is not just a publication; it’s a comprehensive toolkit for anyone serious about mastering the ever-changing landscape of data and AI. Grab your copy and start transforming your data expertise today! 📥 Feedback on the Weekly EditionTake our weekly survey and get a free PDF copy of our best-selling book, "Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!Share your Feedback!Cheers,Merlyn ShelleyEditor-in-Chief, PacktSign Up | Advertise | Archives🔰 GitHub Finds: Any of These Repos in Your Toolbox?🛠️ deepseek-ai/DeepSeek-VL: Open-source Vision-Language (VL) model for real-world tasks, handling logical diagrams, web pages, formulas, scientific literature, and more. 🛠️ OpenGVLab/VideoMamba: VideoMamba enhances 3D CNNs and video transformers, excelling in long-term video understanding with scalability and modality compatibility. 🛠️ showlab/DragAnything: DragAnything uses entity representation for motion control in video generation, offering user-friendly interaction and outperforming existing methods. 🛠️ pkunlp-icler/FastV: FastV accelerates large vision language models by pruning redundant visual tokens, achieving 45% FLOPs reduction without performance loss. 🛠️ cnulab/RealNet: RealNet introduces SDAS for anomaly strength control, AFS for feature selection, and RRS for anomaly region identification. Partnering with SurfsharkSurfshark is allowing our readers to enjoy a full 2 years of their award-winning VPN protection for 79% off, plus 2 months free. With Surfshark One, you get: Unlimited devices and connections ✅ One account for the entire household ✅ Your online activity, made safe, secure, and invisible ✅ Plus, identity protection, ad blocking, antivirus, and data breach monitoring.Claim your VPN protection today! 📚 Expert Insights from Packt CommunityPractical Guide to Applied Conformal Prediction in Python - By Valery Manokhin Basic components of a conformal predictor We will now look at the basic components of a conformal predictor: Nonconformity measure: The nonconformity measure is a function that evaluates how much a new data point differs from the existing data points. It compares the new observation to either the entire dataset (in the full transductive version of conformal prediction) or the calibration set (in the most popular variant – ICP. The selection of the nonconformity measure is based on a particular machine learning task, such as classification, regression, or time series forecasting, as well as the underlying model. This will examine several nonconformity measures suitable for classification and regression tasks. Calibration set: The calibration set is a portion of the dataset used to calculate nonconformity scores for the known data points. These scores are a reference for establishing prediction intervals or regions for new test data points. The calibration set should be a representative sample of the entire data distribution and is typically randomly selected. The calibration set should contain a sufficient number of data points (at least 500). If the dataset is small and insufficient to reserve enough data for the calibration set, the user should consider other variants of conformal prediction – including TCP (see, for example, Mastering Classical Transductive Conformal Prediction in Action – https://medium.com/@valeman/how-to-use-full-transductive-conformal-prediction-7ed54dc6b72b). Test set: The test set contains new data points for generating predictions. For every data point in the test set, the conformal prediction model calculates a nonconformity score using the nonconformity measure and compares it to the scores from the calibration set. Using this comparison, the conformal predictor generates a prediction region that includes the target value with a user-defined confidence level. All these components work in tandem to create a conformal prediction framework that facilitates valid and efficient uncertainty quantification in a wide range of machine learning tasks. Discover more insights from 'Practical Guide to Applied Conformal Prediction in Python' by Valery Manokhin. Unlock access to the full book and a wealth of other titles with a 7-day free trial in the Packt Library. Start exploring today!   Read Here!⚡ Tech Tidbits: Stay Wired to the Latest Industry Buzz! AWS ML Made Easy 🌀 Enhance code review and approval efficiency with generative AI using Amazon Bedrock: This post discusses the challenges faced by managers in overseeing code review and approval processes in software development, such as lack of technical expertise, time constraints, volume of change requests, manual effort, and the need for documentation. It also introduces a solution that leverages generative artificial intelligence and integrates it with AWS deployment tools to streamline the review and approval process. The solution includes automated change analysis, summarization, and an approval workflow. Google Research 🌀 Cappy: Outperforming and boosting large multi-task language models with a small scorer. This blog discusses advancements in large language models (LLMs) and their use in natural language processing (NLP). It introduces the concept of multi-task LLMs, such as T0, FLAN, and OPT-IML, which excel at understanding and solving various tasks. It also presents a new approach called Cappy, a lightweight pre-trained scorer that enhances the performance and efficiency of multi-task LLMs. 🌀 Chain-of-table: Evolving tables in the reasoning chain for table understanding. This research focuses on improving how large language models (LLMs) reason over tabular data, which is challenging due to the structured nature of tables. The proposed framework, Chain-of-Table, trains LLMs to iteratively update tables, mimicking human reasoning, resulting in improved performance on table understanding tasks. 🌀 Talk like a graph: Encoding graphs for large language models. This research explores how to teach large language models (LLMs) to reason with graph information, crucial for understanding interconnected data. They introduce GraphQA, a benchmark to evaluate LLMs on graph problems, revealing insights into effective graph encoding methods and improving LLM performance on graph tasks by up to 60%. 🌀 Scalable Instructable Multiworld Agent (SIMA): A generalist AI agent for 3D virtual environments. Google DeepMind has developed SIMA, a versatile AI agent trained on multiple video games to follow natural-language instructions, akin to human behavior. Collaborating with game studios, SIMA navigates various environments, showcasing potential for AI to understand and execute diverse tasks. OpenAI Updates 🌀 Review completed & Altman, Brockman to continue to lead OpenAI: The OpenAI Board completed a review by WilmerHale, expressing full confidence in Sam Altman and Greg Brockman's leadership. They also elected new board members and adopted governance enhancements. WilmerHale's review found a breakdown in trust between the prior Board and Mr. Altman, leading to his removal, but concluded that his conduct did not mandate removal. Following the review, the Board endorsed the decision to rehire Mr. Altman and Mr. Brockman. 🌀 Global news partnerships: Le Monde and Prisa Media: OpenAI has partnered with Le Monde and Prisa Media to bring French and Spanish news content to ChatGPT. This partnership aims to enhance user interaction with news content and contribute to the training of OpenAI's models. Through these partnerships, users will access summaries and links to original articles, expanding their news consumption experience. This collaboration supports the news industry and its role in providing reliable information globally. Email Forwarded? Join DataPro Here!🔍 From Bits to BERT: Keeping Up with LLMs & GPTs 🌀 Introducing Devin, the first AI software engineer: Meet Devin, the autonomous AI software engineer, skilled in long-term reasoning and planning. Devin can learn new technologies, build and deploy apps, find and fix bugs, train AI models, and contribute to open source. Devin excels in resolving real-world GitHub issues, outperforming previous models. Cognition, the AI lab behind Devin, aims to unlock new possibilities beyond coding. 🌀 Google’s Croissant: a metadata format for ML-ready datasets. Croissant is a new metadata format for ML datasets, aiming to simplify the use of existing datasets for training ML models. It standardizes dataset descriptions and organization, supporting responsible AI practices. Croissant builds upon schema.org and is supported by major tools and repositories like Kaggle, Hugging Face, and OpenML. It includes a specification, example datasets, a Python library, and a visual editor to facilitate dataset usage and publication. 🌀 Kaist AI’s INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models. This research focuses on enhancing search accuracy by improving retrievers to understand users' intentions, similar to language models. It introduces INSTRUCTIR, a benchmark for evaluating retrievers' ability to follow user-aligned instructions in retrieval tasks. The study addresses limitations in existing benchmarks and highlights potential overfitting issues in instruction-aware retrieval datasets.  🌀 Sussex AI’s Spyx: A Library for Just-In-Time Compiled Optimization of Spiking Neural Networks. Advancements in large neural architectures have led to powerful AI accelerators for training deep neural networks. However, these networks often incur high costs. Neuromorphic computing with Spiking Neural Networks (SNNs) offers energy-efficient alternatives, but training SNNs is challenging. Spyx, a new lightweight SNN simulation and optimization library designed in JAX, aims to facilitate SNN architecture investigation by bridging Python-based deep learning frameworks with custom compute kernels, achieving optimal hardware utilization. 🌀 VMware’s SynCode: Improving LLM Code Generation with Grammar Augmentation. SynCode is a novel framework for efficient syntactical decoding of code with large language models (LLMs). It leverages grammar of a programming language using an offline-constructed efficient lookup table called Deterministic Finite Automaton (DFA) mask store. SynCode seamlessly integrates with any context-free grammar (CFG) defined language, reducing syntax errors by 96.07% when combined with LLMs. 🌀 Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. Chatbot Arena is an open platform designed to evaluate Large Language Models (LLMs) by considering human preferences. Utilizing a pairwise comparison method and crowdsourced input, it assesses LLMs' alignment with user preferences. The platform, operational for months with over 240K votes, provides a credible and valuable resource for ranking LLMs. Check out the tool here. 🌀 Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People. The project aims to develop medical Large Language Models (LLMs) in the six most spoken languages, benefiting 6.1 billion people. This includes creating the ApolloCorpora multilingual medical dataset and the XMedBench benchmark, with Apollo models achieving top performance among models of similar sizes. The project will open-source training data, code, model weights, and evaluation benchmarks. You can check for the demo here. ✨ On the Radar: Catch Up on What's Fresh🌀 Top Artificial Intelligence (AI) Tools That Can Generate Code To Help Programmers (2024): The article discusses how AI is changing programming, with tools like OpenAI Codex and GitHub Copilot generating code. It explores AI's impact on code quality and development speed, showcasing various AI-powered tools like Tabnine, CodeT5, and Polycoder. Additionally, it mentions AI tools for code review, static code analysis, and AI-assisted coding in IDEs like PyCharm and Visual Studio. 🌀 Pypi mirror in a private AWS environment Terraform: This article explains how to install Python packages in an AWS Sagemaker Studio environment without internet access. It covers setting up Sagemaker in VPC Only mode, using VPC Endpoint interfaces for network communications, and accessing the Pypi package repository through AWS Codeartifact, which allows defining Pypi as an upstream repository. 🌀 Custom pre-commit hooks for safer code changes: This blog post explains the importance of using pre-commit hooks in software development, particularly with the git version control system. It discusses the challenges of maintaining coding standards in collaborative projects and provides a step-by-step tutorial on how to set up and use custom pre-commit hooks for a Python project, using the example of validating dataflow definitions for the Hamilton library. 🌀 AQLM Quantization Algorithm, explained: A new quantization algorithm, AQLM (Additive Quantization of Language Models), was recently released and integrated into HuggingFace Transformers and HuggingFace PEFT. AQLM sets a new state-of-the-art for 2-bit quantization while providing improvements for 3-bit and 4-bit ranges, pushing the boundaries of model accuracy and memory footprint. 🌀 Revolutionize Web Browsing with AI: This article explores creating an AI agent using the gpt-4-vision-preview model from OpenAI, enabling it to navigate the web like a human. It discusses the agent's browser control, content browsing, and decision-making processes, showcasing potential use cases such as aiding visually challenged users and automating web browsing tasks. 🌀 Understanding Tensors: Learning a Data Structure Through 3 Pesky Errors. This article discusses transitioning from managing tabular data to working with tensors in TensorFlow, offering debugging tips and code recipes. It covers visualizing TensorFlow datasets, understanding tensor specs, and augmenting model summaries, while addressing common errors related to tensor rank and shape. 🌀 Running RStudio Inside a Container: This tutorial focuses on setting up RStudio using Docker, particularly leveraging the Rocker RStudio image. It covers pulling the image, launching RStudio in a container, and ensuring persistence of data by using volume mapping. The tutorial provides step-by-step instructions and explanations for each stage. 🌀 PyTorch and MLX for Apple Silicon: The blog discusses Apple's MLX framework, which is optimized for Apple Silicon and serves as a bridge between PyTorch, NumPy, and Jax. It details a comparison between MLX and PyTorch through a custom convolutional neural network implementation for image classification tasks. The discussion includes insights into MLX's features, such as its array class, lazy computation, and compilation for performance optimization. The post also highlights the ease of converting PyTorch code to MLX, despite some differences in API compatibility and coding conventions. See you next time!Affiliate Disclosure: This newsletter contains affiliate links. If you buy through them, we may earn a small commission at no extra cost to you. This supports our work and helps us keep providing useful content. We only recommend products and services we think will benefit our readers. Thanks for your support! 
Read more
  • 0
  • 0
  • 33695

article-image-auto-generate-texts-shakespeare-writing-using-deep-recurrent-neural-networks
Savia Lobo
16 Feb 2018
6 min read
Save for later

How to auto-generate texts from Shakespeare writing using deep recurrent neural networks

Savia Lobo
16 Feb 2018
6 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from a book co-authored by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled as Natural Language Processing with Python Cookbook. This book will give unique recipes to know various aspects of performing Natural Language Processing with NLTK—a leading Python platform for NLP.[/box] Today we will learn to use deep recurrent neural networks (RNN) to predict the next character based on the given length of a sentence. This way of training a model is able to generate automated text continuously, which can imitate the writing style of the original writer with enough training on the number of epochs and so on. Getting ready... The Project Gutenberg eBook of the complete works of William Shakespeare's dataset is used to train the network for automated text generation. Data can be downloaded from http:// www.gutenberg.org/ for the raw file used for training: >>> from  future import print_function >>> import numpy as np >>> import random >>> import sys The following code is used to create a dictionary of characters to indices and vice-versa mapping, which we will be using to convert text into indices at later stages. This is because deep learning models cannot understand English and everything needs to be mapped into indices to train these models: >>> path = 'C:UsersprataDocumentsbook_codes NLP_DL shakespeare_final.txt' >>> text = open(path).read().lower() >>> characters = sorted(list(set(text))) >>> print('corpus length:', len(text)) >>> print('total chars:', len(characters)) >>> char2indices = dict((c, i) for i, c in enumerate(characters)) >>> indices2char = dict((i, c) for i, c in enumerate(characters)) How to do it… Before training the model, various preprocessing steps are involved to make it work. The following are the major steps involved: Preprocessing: Prepare X and Y data from the given entire story text file and converting them into indices vectorized format. Deep learning model training and validation: Train and validate the deep learning model. Text generation: Generate the text with the trained model. How it works... The following lines of code describe the entire modeling process of generating text from Shakespeare's writings. Here we have chosen character length. This needs to be considered as 40 to determine the next best single character, which seems to be very fair to consider. Also, this extraction process jumps by three steps to avoid any overlapping between two consecutive extractions, to create a dataset more fairly: # cut the text in semi-redundant sequences of maxlen characters >>> maxlen = 40 >>> step = 3 >>> sentences = [] >>> next_chars = [] >>> for i in range(0, len(text) - maxlen, step): ... sentences.append(text[i: i + maxlen]) ... next_chars.append(text[i + maxlen]) ... print('nb sequences:', len(sentences)) The following screenshot depicts the total number of sentences considered, 193798, which is enough data for text generation: The next code block is used to convert the data into a vectorized format for feeding into deep learning models, as the models cannot understand anything about text, words, sentences and so on. Initially, total dimensions are created with all zeros in the NumPy array and filled with relevant places with dictionary mappings: # Converting indices into vectorized format >>> X = np.zeros((len(sentences), maxlen, len(characters)), dtype=np.bool) >>> y = np.zeros((len(sentences), len(characters)), dtype=np.bool) >>> for i, sentence in enumerate(sentences): ... for t, char in enumerate(sentence): ... X[i, t, char2indices[char]] = 1 ... y[i, char2indices[next_chars[i]]] = 1 >>> from keras.models import Sequential >>> from keras.layers import Dense, LSTM,Activation,Dropout >>> from keras.optimizers import RMSprop The deep learning model is created with RNN, more specifically Long Short-Term Memory networks with 128 hidden neurons, and the output is in the dimensions of the characters. The number of columns in the array is the number of characters. Finally, the softmax function is used with the RMSprop optimizer. We encourage readers to try with other various parameters to check out how results vary: #Model Building >>> model = Sequential() >>> model.add(LSTM(128, input_shape=(maxlen, len(characters)))) >>> model.add(Dense(len(characters))) >>> model.add(Activation('softmax')) >>> model.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=0.01)) >>> print (model.summary()) As mentioned earlier, deep learning models train on number indices to map input to output (given a length of 40 characters, the model will predict the next best character). The following code is used to convert the predicted indices back to the relevant character by determining the maximum index of the character: # Function to convert prediction into index >>> def pred_indices(preds, metric=1.0): ... preds = np.asarray(preds).astype('float64') ... preds = np.log(preds) / metric ... exp_preds = np.exp(preds) ... preds = exp_preds/np.sum(exp_preds) ... probs = np.random.multinomial(1, preds, 1) ... return np.argmax(probs) The model will be trained over 30 iterations with a batch size of 128. And also, the diversity has been changed to see the impact on the predictions: # Train and Evaluate the Model >>> for iteration in range(1, 30): ... print('-' * 40) ... print('Iteration', iteration) ... model.fit(X, y,batch_size=128,epochs=1).. ... start_index = random.randint(0, len(text) - maxlen - 1) ... for diversity in [0.2, 0.7,1.2]: ... print('n----- diversity:', diversity) ... generated = '' ... sentence = text[start_index: start_index + maxlen] ... generated += sentence ... print('----- Generating with seed: "' + sentence + '"') ... sys.stdout.write(generated) ... for i in range(400): ... x = np.zeros((1, maxlen, len(characters))) ... for t, char in enumerate(sentence): ... x[0, t, char2indices[char]] = 1. ... preds = model.predict(x, verbose=0)[0] ... next_index = pred_indices(preds, diversity) ... pred_char = indices2char[next_index] ... generated += pred_char ... sentence = sentence[1:] + pred_char ... sys.stdout.write(pred_char) ... sys.stdout.flush() ... print("nOne combination completed n") The results are shown in the next screenshot to compare the first iteration (Iteration 1) and final iteration (Iteration 29). It is apparent that with enough training, the text generation seems to be much better than with Iteration 1: Text generation after Iteration 29 is shown in this image: Though the text generation seems to be magical, we have generated text using Shakespeare's writings, proving that with the right training and handling, we can imitate any style of writing of a particular writer. If you found this post useful, you may check out this book Natural Language Processing with Python Cookbook to analyze sentence structure and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and other NLP techniques.  
Read more
  • 0
  • 0
  • 33693
Modal Close icon
Modal Close icon