Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-learning-the-salesforce-analytics-query-language-saql

09 Mar 2018

6 min read

Learning the Salesforce Analytics Query Language (SAQL)

09 Mar 2018

Salesforce Einstein offers its own query language to retrieve your data from various sources, called Salesforce Analytics Query Language (SAQL). The lenses and dashboards in Einstein use SAQL behind the scenes to manipulate data for meaningful visualizations. In this article, we see how to use Salesforce Analytics Query Language effectively. Using SAQL There are the following three ways to use SAQL in Einstein Analytics: Creating steps/lenses: We can use SAQL while creating a lens or step. It is the easiest way of using SAQL. While creating a step, Einstein Analytics provides the flexibility of switching between modes such as Chart Mode, Table Mode, and SAQL Mode. In this chapter, we will use this method for SAQL. Analytics REST API: Using this API, the user can access the datasets, lenses, dashboards, and so on. This is a programmatic approach and you can send the queries to the Einstein Analytics platform. Einstein Analytics uses the OAuth 2.0 protocol to securely access the platform data. The OAuth protocol is a way of securely authenticating the user without asking them for credentials. The first step to using the Analytics REST API to access Analytics is to authenticate the user using OAuth 2.0. Using Dashboard JSON: We can use SAQL while editing the Dashboard JSON. We have already seen the Dashboard JSON in previous chapters. To access Dashboard JSON, you can open the dashboard in the edit mode and press Ctrl + E. The simplest way of using SAQL is while creating a step or lens. A user can switch between the modes here. To use SAQL for lens, perform the following steps: Navigate to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner, as shown in the following screenshot: In SAQL, the query is made up of multiple statements. In the first statement, the query loads the input data from the dataset, operates on it, and then finally gives the result. The user can use the Run Query button to see the results and errors after changing or adding statements. The user can see the errors at the bottom of the Query editor. SAQL is made up of statements that take the input dataset, and we build our logic on that. We can add filters, groups, orders, and so on, to this dataset to get the desired output. There are certain order rules that need to be followed while creating these statements and those rules are as follows: There can be only one offset in the foreach statement The limit statement must be after offset The offset statement must be after filter and order The order and filter statements can be swapped as there is no rule for them In SAQL, we can perform all the mathematical calculations and comparisons. SAQL also supports arithmetic operators, comparison operators, string operators, and logical operators. Using foreach in SAQL The foreach statement applies the set of expressions to every row, which is called projection. The foreach statement is mandatory to get the output of the query. The following is the syntax for the foreach statement: q = foreach q generate expression as 'expresion name'; Let's look at one example of using the foreach statement: Go to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner. In the Query editor you will see the following code: q = load "opportunity"; q = group q by all; q = foreach q generate count() as 'count'; q = limit q 2000; You can see the result of this query just below the Query editor: 4. Now replace the third statement with the following statement: q = foreach q generate sum('Amount') as 'Sum Amount'; 5. Click on the Run Query button and observe the result as shown in the following screenshot: Using grouping in SAQL The user can group records of the same value in one group by using the group statements. Use the following syntax: q = group rows by fieldName Let's see how to use grouping in SAQL by performing the following steps: Replace the second and third statement with the following statement: q = group q by 'StageName'; q = foreach q generate 'StageName' as 'StageName', sum('Amount') as 'Sum Amount'; 2. Click on the Run Query button and you should see the following result: Using filters in SAQL Filters in SAQL behave just like a where clause in SOQL and SQL, filtering the data as per the condition or clause. In Einstein Analytics, it selects the row from the dataset that satisfies the condition added. The syntax for the filter is as follows: q = filter q by fieldName 'Operator' value Click on Run Query and view the result as shown in the following screenshot: Using functions in SAQL The beauty of a function is in its reusability. Once the function is created it can be used multiple times. In SAQL, we can use different types of functions, such as string functions, math functions, aggregate functions, windowing functions, and so on. These functions are predefined and saved quite a few times. Let's use a math function power. The syntax for the power is power(m, n). The function returns the value of m raised to the nth power. Replace the following statement with the fourth statement: q = foreach q generate 'StageName' as 'StageName', power(sum('Amount'), 1/2) as 'Amount Squareroot', sum('Amount') as 'Sum Amount'; Click on the Run Query button. We saw how to apply different kinds of case-specific functions in Salesforce Einstein to play with data in order to get the desired outcome. [box type="note" align="" class="" width=""]The above excerpt is taken from the book Learning Einstein Analytics, written by Santosh Chitalkar. It covers techniques to set-up and create apps, lenses, and dashboards using Salesforce Einstein Analytics for effective business insights. If you want to know more about these techniques, check out the book Learning Einstein Analytics.[/box]

0
1
30864

article-image-perform-data-partitioning-postgresql-10

Sugandha Lahoti

09 Mar 2018

11 min read

How to perform data partitioning in PostgreSQL 10

Sugandha Lahoti

09 Mar 2018

11 min read

0
0
21791

article-image-how-to-perform-audio-video-image-scraping-with-python

Amarabha Banerjee

08 Mar 2018

9 min read

How to perform Audio-Video-Image Scraping with Python

Amarabha Banerjee

08 Mar 2018

9 min read

[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server. Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally. Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It's a simple step from there to also transcode video with ffmpeg. Downloading media content from the web Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content. Getting ready There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file. How to do it Here is how we proceed with the recipe: The URLUtility class can download content from a URL. The code in the recipe's file is the following: import const from util.urls import URLUtility util = URLUtility(const.ApodEclipseImage()) print(len(util.data)) When running this you will see the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014 The example reads 171014 bytes of data. How it works The URL is defined as a constant const.ApodEclipseImage() in the const module: def ApodEclipseImage(): return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg" The constructor of the URLUtility class has the following implementation: def __init__(self, url, readNow=True): """ Construct the object, parse the URL, and download now if specified""" self._url = url self._response = None self._parsed = urlparse(url) if readNow: self.read() The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method: def read(self): self._response = urllib.request.urlopen(self._url) self._data = self._response.read() This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property: @property def data(self): self.ensure_response() return self._data The code then simply reports on the length of that data, with the value of 171014. There's more This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next. Parsing a URL with urllib to get the filename When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name? Getting ready We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py. How to do it Execute the recipe's file with your python interpreter. It will run the following code: util = URLUtility(const.ApodEclipseImage()) print(util.filename_without_ext) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The filename is: BT5643s How it works In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively: >>> parsed = urlparse(const.ApodEclipseImage()) >>> parsed ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='') The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension: @property def filename_without_ext(self): filename = os.path.splitext(os.path.basename(self._parsed.path))[0] return filename The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename). There's more It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That's our next recipe. Determining the type of content for a URL When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content. Getting ready We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py. How to do it We proceed as follows: Execute the script for the recipe. It contains the following code: util = URLUtility(const.ApodEclipseImage()) print("The content type is: " + util.contenttype) With the following result: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The content type is: image/jpeg How it works The .contentype property is implemented as follows: @property def contenttype(self): self.ensure_response() return self._response.headers['content-type'] The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed. There's more The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned: >>> response = urllib.request.urlopen(const.ApodEclipseImage()) >>> for header in response.headers: print(header) Date Server Last-Modified ETag Accept-Ranges Content-Length Connection Content-Type Strict-Transport-Security And we can see the values for each of these headers. >>> for header in response.headers: print(header + " ==> " + response.headers[header]) Date ==> Tue, 26 Sep 2017 19:31:41 GMT Server ==> WebServer/1.0 Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT ETag ==> "547bb44-29c06-5581275ce2b86" Accept-Ranges ==> bytes Content-Length ==> 171014 Connection ==> close Content-Type ==> image/jpeg Strict-Transport-Security ==> max-age=31536000; includeSubDomains Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist. Determining the file extension from a content type It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file. Getting ready We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):. How to do it Proceed by running the recipe's script. An extension for the media type can be found using the .extension property: util = URLUtility(const.ApodEclipseImage()) print("Filename from content-type: " + util.extension_from_contenttype) print("Filename from url: " + util.extension_from_url) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes Filename from content-type: .jpg Filename from url: .jpg This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same. How it works The following is the implementation of the .extension_from_contenttype property: @property def extension_from_contenttype(self): self.ensure_response() map = const.ContentTypeToExtensions() if self.contenttype in map: return map[self.contenttype] return None The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension: def ContentTypeToExtensions(): return { "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/png": ".png" } If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url: @property def extension_from_url(self): ext = os.path.splitext(os.path.basename(self._parsed.path))[1] return ext This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename. To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python. If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.

0
0
33710

article-image-4-common-challenges-web-scraping-handle

Amarabha Banerjee

08 Mar 2018

13 min read

4 common challenges in Web Scraping and how to handle them

Amarabha Banerjee

08 Mar 2018

13 min read

[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] In this article, we will explore primary challenges of Web Scraping and how to get away with it easily. Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds. Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands: docker pull mheydt/pywebscrapecookbook docker run -p 5001:5001 pywebscrapecookbook Retrying failed page downloads Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes: [500, 502, 503, 504, 408] The process can be further configured using the following parameters: RETRY_ENABLED (True/False - default is True) RETRY_TIMES (# of times to retry on any errors - default is 2) RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408]) How to do it The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.retry.RetryMiddleware": 500 }, 'RETRY_ENABLED': True, 'RETRY_TIMES': 3 }) process.crawl(Spider) process.start() How it works Scrapy will pick up the configuration for retries as specified when the spider is run. When encountering errors, Scrapy will retry up to three times before giving up. Supporting page redirects Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters: REDIRECT_ENABLED: (True/False - default is True) REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20) How to do it The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are a few lines demonstrating the output: Parsing: <200 https://www.nasa.gov/content/earth-expeditions-above/> ['http://www.nasa.gov/content/earth-expeditions-above', 'https://www.nasa.gov/content/earth-expeditions-above'] This particular URL was processed after one redirection, from an HTTP to an HTTPS version of the URL. The list defines all of the URLs that were involved in the redirection. You will also be able to see where redirection exceeded the specified level (2) in the output pages. The following is one example: 2017-10-22 17:55:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://www.nasa.gov/topics/journeytomars/news/index.html>: max redirections reached How it works The spider is defined as the following: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] def parse(self, response): print("Parsing: ", response) print (response.request.meta.get('redirect_urls')) This is identical to our previous NASA sitemap based crawler, with the addition of one line printing the redirect_urls. In any call to parse, this metadata will contain all redirects that occurred to get to this page. The crawling process is configured with the following code: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 500 }, 'REDIRECT_ENABLED': True, 'REDIRECT_MAX_TIMES': 2 }) Redirect is enabled by default, but this sets the maximum number of redirects to 2 instead of the default of 20. Waiting for content to be available in Selenium A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asynchronously to the page after loading. Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button: When pressing the button, we are presented with a progress bar for five seconds: And when this is completed, we are presented with Hello World! Now suppose we want to scrape this page to get the content that is exposed only after the button is pressed and after the wait? How do we do this? How to do it We can do this using Selenium. We will use two features of Selenium. The first is the ability to click on page elements. The second is the ability to wait until an element with a specific ID is available on the page. First, we get the button and click it. The button's HTML is the following: <div id='start'> <button>Start</button> </div> When the button is pressed and the load completes, the following HTML is added to the document: <div id='finish'> <h4>Hello World!"</h4> </div> We will use the Selenium driver to find the Start button, click it, and then wait until a div with an ID of 'finish' is available. Then we get that element and return the text in the enclosed <h4> tag. You can try this by running 06/03_press_and_wait.py. It's output will be the following: clicked Hello World! Now let's see how it worked. How it works Let us break down the explanation: We start by importing the required items from Selenium: from selenium import webdriver from selenium.webdriver.support import ui Now we load the driver and the page: driver = webdriver.PhantomJS() driver.get("http://the-internet.herokuapp.com/dynamic_loading/2") With the page loaded, we can retrieve the button: button = driver.find_element_by_xpath("//*/div[@id='start']/button") And then we can click the button: button.click() print("clicked") Next we create a WebDriverWait object: wait = ui.WebDriverWait(driver, 10) With this object, we can request Selenium's UI wait for certain events. This also sets a maximum wait of 10 seconds. Now using this, we can wait until we meet a criterion; that an element is identifiable using the following XPath: wait.until(lambda driver: driver.find_element_by_xpath("//*/div[@id='finish']")) When this completes, we can retrieve the h4 element and get its enclosing text: finish_element=driver.find_element_by_xpath("//*/div[@id='finish']/ h4") print(finish_element.text) Limiting crawling to a single domain We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class. How to do it The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov. How it works The code is the same as previous NASA site crawlers except that we include allowed_domains=['nasa.gov']: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] allowed_domains=['nasa.gov'] def parse(self, response): print("Parsing: ", response) The NASA site is fairly consistent with staying within its root domain, but there are occasional links to other sites such as content on boeing.com. This code will prevent moving to those external sites. Processing infinitely scrolling pages Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart. While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page. Let's look at spidyquotes.herokuapp.com/scroll as an example. Getting ready Open http://spidyquotes.herokuapp.com/scroll in your browser. This page will load additional content when you scroll to the bottom of the page: Screenshot of the quotes to scrape Once the page is open, go into your developer tools and select the network panel. Then, scroll to the bottom of the page. You will see new content in the network panel: When we click on one of the links, we can see the following JSON: { "has_next": true, "page": 2, "quotes": [{ "author": { "goodreads_link": "/author/show/82952.Marilyn_Monroe", "name": "Marilyn Monroe", "slug": "Marilyn-Monroe" }, "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "u201cThis life is what you make it...." }, { "author": { "goodreads_link": "/author/show/1077326.J_K_Rowling", "name": "J.K. Rowling", "slug": "J-K-Rowling" }, "tags": ["courage", "friends"], "text": "u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.u201d" }, This is great because all we need to do is continually generate requests to /api/quotes?page=x, increasing x until the has_next tag exists in the reply document. If there are no more pages, then this tag will not be in the document. How to do it The 06/05_scrapy_continuous.py file contains a Scrapy agent, which crawls this set of pages. Run it with your Python interpreter and you will see output similar to the following (the following is multiple excerpts from the output): <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'Sisters']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'Understand']} When this gets to page 10 it will stop as it will see that there is no next page flag set in the Content. How it works Let's walk through the spider to see how this works. The spider starts with the following definition of the start URL: class Spider(scrapy.Spider): name = 'spidyquotes' quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes' start_urls = [quotes_base_url] download_delay = 1.5 The parse method then prints the response and also parses the JSON into the data variable: def parse(self, response): print(response) data = json.loads(response.body) Then it loops through all the items in the quotes element of the JSON objects. For each item, it yields a new Scrapy item back to the Scrapy engine: for item in data.get('quotes', []): yield { 'text': item.get('text'), 'author': item.get('author', {}).get('name'), 'tags': item.get('tags'), } It then checks to see if the data JSON variable has a 'has_next' property, and if so it gets the next page and yields a new request back to Scrapy to parse the next page: if data['has_next']: next_page = data['page'] + 1 yield scrapy.Request(self.quotes_base_url + "?page=%s" % next_page) There's more... It is also possible to process infinite, scrolling pages using Selenium. The following code is in 06/06_scrape_continuous_twitter.py: from selenium import webdriver import time driver = webdriver.PhantomJS() print("Starting") driver.get("https://twitter.com") scroll_pause_time = 1.5 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: print(last_height) # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(scroll_pause_time) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") print(new_height, last_height) if new_height == last_height: break last_height = new_height The output would be similar to the following: Starting 4882 8139 4882 8139 11630 8139 11630 15055 11630 15055 15055 15055 Process finished with exit code 0 This code starts by loading the page from Twitter. The call to .get() will return when the page is fully loaded. The scrollHeight is then retrieved, and the program scrolls to that height and waits for a moment for the new content to load. The scrollHeight of the browser is retrieved again, and if different than last_height, it will loop and continue processing. If the same as last_height, no new content has loaded and you can then continue on and retrieve the HTML for the completed page. We have discussed the common challenges faced in performing Web Scraping using Python and got to know their workaround. If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Python and perform efficient web scraping.

0
0
35552

article-image-spam-filtering-natural-language-processing-approach

Packt

08 Mar 2018

16 min read

Spam Filtering - Natural Language Processing Approach

Packt

08 Mar 2018

16 min read

0
0
7810

article-image-data-explorationusing-spark-sql

Packt

08 Mar 2018

9 min read

Data Exploration using Spark SQL

Packt

08 Mar 2018

9 min read

In this article, Aurobindo Sarkar, the author of the book, Learning Spark SQL, we will be covering the following points to introduce you to using Spark SQL for exploratory data analysis. What is exploratory Data Analysis (EDA)? Why EDA is important? Using Spark SQL for basic data analysis Visualizing data with Apache Zeppelin Introducing Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a “feel”, for the data set as a result of such exploration. More specifically, the graphical techniques allow us to, efficiently, select and validate appropriate models, test our assumptions, identify relationships, select estimators, and detect outliers. EDA involves a lot of trial and error, and several iterations.The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement.For example, you should plot simple histograms and scatterplots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take long to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java, and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer there are many good tools available such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data then this section will offer some good tools and techniques data exploration. Here, we will do some basic data exploration exercises to understand a sample dataset. We will use a dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We use the bank-additional-full.csv file that contains 41188 records and 20 input fields, ordered by date (from May 2008 to November 2010). As a first step, let’s define a schema and read in the CSV file to create a DataFrame. You can use :paste to paste-in the initial set of statements in the Spark shell, as shown in the following figure: After the DataFrame is created, we first verify the number of records. We can also define a case class called Call for our input records, and then create a strongly-typed Dataset as follows: In the next section, we will begin our data exploration by identifying missing data in our dataset. Identifying Missing Data Missing data can occur in datasets due to reasons ranging from negligence to a refusal on part of respondants to provide a specific data point. However, in all cases missing data is a common occurrence in real-world datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies for dealing with it. Here, weanalyzethe numbers of records with missing data fields in our sample dataset. In order to simulate missing data, we will edit our sample dataset by replacing fields containing “unknown” values with empty strings. First, we created a DataFrame / Dataset from our edited file, as shown in the following figure: The following two statements give us a count of rows with certain fields having missing data. Later, we will look at effective ways of dealing with missing data and compute some basic statistics for sample dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a dataset containing a subset of fields from our original DataFrame. In our following example, we choose some of the numeric fields and the outcome field that is the “term deposit subscribed” field. Next, we use describe to quickly compute the count, mean, standard deviation, min and max values for the numeric columns in our dataset. Further, we use the stat package to compute additional statistics like covariance, correlation, creating crosstabs, examining items that occur most frequently in data columns, and computing quantiles. These computations are shown in the following figure: Next, we use the typed aggregation functions to summarize our data to understand our data better. In the following statement, we aggregate the results by whether a term deposit was subscribed along with total customers contacted, average number of calls made per customer, the average duration of the calls and the average number of previous calls made to such customers. The results are rounded to two decimal points. Similarly, executing the following statement givessimilar results by customers’ age. After getting a better understanding of our data by computing basic statistics, we shift our focus to identifying outliers in our data. Identifying data outliers An outlier or an anomalyis an observation of the data that deviates significantly from other observations in the dataset. Erroneous outliers are observations thatare distorted due to possible errors in the data-collection process. These outliers may exert undue influence on the results of statistical analysis, sothey should be identified using reliable detection methods prior to performing data analysis. Many algorithms find outliers as a side-product of clustering algorithms. However these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using a statistical distributions, and the outliers identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically doesnot have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. SparkMLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. In the following example, we use the k-means clustering algorithm to compute two clusters in our data. Other distributed algorithms useful for EDA includeclassification, regression, dimensionality reduction, correlation and hypothesis testing. Visualizing data with Apache Zeppelin Typically, we will generate many graphs to verify our hunches about the data.A lot of thesequick and dirty graphs used during EDA are,ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers, typically, cannot handle millions of data points.Hence we have to summarize, sample or model our data before we can effectively visualize it. Traditionally, BI tools provided extensive aggregation and pivoting features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data wassubsequently downloaded and visualizedon the practitioner’s workstations. Spark can eliminate many of these batch jobs to support interactive data visualization. Here, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is a web-based tool that supports interactive data analysis and visualization. It supports several language interpreters and comes with built-in Spark integration. Hence, it is quick and easy to get started with exploratory data analysis using Apache Zeppelin. You can download Appache Zeppelin from https://zeppelin.apache.org/. Unzip the package on your hard drive and start Zeppelin using the following command: Aurobindos-MacBook-Pro-2:zeppelin-0.6.2-bin-allaurobindosarkar$ bin/zeppelin-daemon.sh start You should see the following message: Zeppelin start [ OK] You should be able to see the Zeppelin home page at: http://localhost:8080/ Click on Create new note link, and specify a path and name for your notebook, as shown in the following figure: In the next step, we paste the same code as in the beginning of this article to create a DataFrame for our sample dataset. We can execute typical DataFrameoperations as shown in the following figure: Next, we create a table from our DataFrame and execute some SQL on it. The results of the SQL statements execution can be charted by clicking on the appropriatechart-type required. Here, we create bar charts as an illustrative example of summarizing and visualizing data: We can also plot a scatter plot, and read the coordinate values of each of the points plotted, as shown in the following two figures. Additionally, we can create a textbox that accepts input values to make experience interactive. In the following figure we create a textbox that can accept different values for the age parameter and the bar chart is updated, accordingly. Similarly, we can also create dropdown lists where the user can select the appropriate option, and the table of values or chart, automatically gets updated. Summary In this article, we demonstrated using Spark SQL for exploring datasets, performing basic data quality checks, generating samples and pivot tables, and visualizing data with Apache Zeppelin.

0
0
27233

article-image-implementing-matrix-operations-using-scipy-numpy

Pravin Dhandre

07 Mar 2018

5 min read

Implementing matrix operations using SciPy and NumPy

Pravin Dhandre

07 Mar 2018

5 min read

0
0
71101

article-image-how-to-set-up-a-deep-learning-system-on-amazon-web-services-aws

Gebin George

07 Mar 2018

5 min read

How to set up a Deep Learning System on Amazon Web Services (AWS)

Gebin George

07 Mar 2018

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials written by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book covers popular Python libraries such as Tensorflow, Keras, and more, along with tips to train, deploy and optimize deep learning models in the best possible manner.[/box] Today, we will learn two different methods of setting up a deep learning system using Amazon Web Services (AWS). Setup from scratch We will illustrate how to set up a deep learning environment on an AWS EC2 GPU instance g2.2xlarge running Ubuntu Server 16.04 LTS. For this example, we will use a pre-baked Amazon Machine Image (AMI) which already has a number of software packages installed—making it easier to set up an end-end deep learning system. We will use a publicly available AMI Image ami-b03ffedf, which has following pre-installed Packages: CUDA 8.0 Anaconda 4.20 with Python 3.0 Keras / Theano The first step to setting up the system is to set up an AWS account and spin a new EC2 GPU instance using the AWS web console as (http://console.aws.amazon.com/) shown in figure Choose EC2 AMI: 2. We pick a g2.2xlarge instance type from the next page as shown in figure Choose instance type: 3. After adding a 30 GB of storage as shown in figure Choose storage, we now launch a cluster and assign an EC2 key pair that can allow us to ssh and log in to the box using the provided key pair file: 4. Once the EC2 box is launched, next step is to install relevant software packages.To ensure proper GPU utilization, it is important to ensure graphics drivers are installed first. We will upgrade and install NVIDIA drivers as follows: $ sudo add-apt-repository ppa:graphics-drivers/ppa -y $ sudo apt-get update $ sudo apt-get install -y nvidia-375 nvidia-settings While NVIDIA drivers ensure that host GPU can now be utilized by any deep learning application, it does not provide an easy interface to application developers for easy programming on the device. Various different software libraries exist today that help achieve this task reliably. Open Computing Language (OpenCL) and CUDA are more commonly used in industry. In this book, we use CUDA as an application programming interface for accessing NVIDIA graphics drivers. To install CUDA driver, we first SSH into the EC2 instance and download CUDA 8.0 to our $HOME folder and install from there: $ wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-r epo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb $ sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_amd64-deb $ sudo apt-get update $ sudo apt-get install -y cuda nvidia-cuda-toolkit Once the installation is finished, you can run the following command to validate the installation: $ nvidia-smi Now your EC2 box is fully configured to be used for a deep learning development. However, for someone who is not very familiar with deep learning implementation details, building a deep learning system from scratch can be a daunting task. To ease this development, a number of advanced deep learning software frameworks exist, such as Keras and Theano. Both of these frameworks are based on a Python development environment, hence we first install a Python distribution on the box, such as Anaconda: $ wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh $ bash Anaconda3-4.2.0-Linux-x86_64.sh Finally, Keras and Theanos are installed using Python’s package manager pip: $ pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git $ pip install keras Once the pip installation is completed successfully, the box is now fully set up for a deep learning development. Setup using Docker The previous section describes getting started from scratch which can be tricky sometimes given continuous changes to software packages and changing links on the web. One way to avoid dependence on links is to use container technology like Docker. In this chapter, we will use the official NVIDIA-Docker image that comes pre-packaged with all the necessary packages and deep learning framework to get you quickly started with deep learning application development: $ sudo add-apt-repository ppa:graphics-drivers/ppa -y $ sudo apt-get update $ sudo apt-get install -y nvidia-375 nvidia-settings nvidia-modprobe We now install Docker Community Edition as follows: $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - # Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 $ sudo apt-key fingerprint 0EBFCD88 $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) Stable" $ sudo apt-get update $ sudo apt-get install -y docker-ce 2. We then install NVIDIA-Docker and its plugin: $ wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nv Idia-docker_1.0.1-1_amd64.deb $ sudo dpkg -i /tmp/nvidia-docker_1.0.1-1_amd64.deb && rm /tmp/nvidia-docker_1.0.1-1_amd64.deb 3. To validate if the installation happened correctly, we use the following command: $ sudo nvidia-docker run --rm nvidia/cuda nvidia-smi 4. Once it’s setup correctly, we can use the official TensorFlow or Theano Docker Image: $ sudo nvidia-docker run -it tensorflow/tensorflow:latest-gpu bash 5. We can run a simple Python program to check if TensorFlow works properly: import tensorflow as tf a = tf.constant(5, tf.float32) b = tf.constant(5, tf.float32) with tf.Session() as sess: sess.run(tf.add(a, b)) # output is 10.0 print("Output of graph computation is = ",output) You should see the TensorFlow output on the screen now as shown in figure Tensorflow sample output: We saw how to set up deep learning system on AWS from scratch and on Docker. If you found our post useful, do check out this book Deep Learning Essentials to optimize deep learning models for better performance output.

0
0
13127

article-image-implement-long-short-term-memory-lstm-tensorflow

Gebin George

06 Mar 2018

4 min read

Implement Long-short Term Memory (LSTM) with TensorFlow

Gebin George

06 Mar 2018

4 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials written by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book will help you get started with the essentials of deep learning and neural network modeling.[/box] In today’s tutorial, we will look at an example of using LSTM in TensorFlow to perform sentiment classification. The input to LSTM will be a sentence or sequence of words. The output of LSTM will be a binary value indicating a positive sentiment with 1 and a negative sentiment with 0. We will use a many-to-one LSTM architecture for this problem since it maps multiple inputs onto a single output. Figure LSTM: Basic cell architecture shows this architecture in more detail. As shown here, the input takes a sequence of word tokens (in this case, a sequence of three words). Each word token is input at a new time step and is input to the hidden state for the corresponding time step. For example, the word Book is input at time step t and is fed to the hidden state ht: Sentiment analysis: To implement this model in TensorFlow, we need to first define a few variables as follows: batch_size = 4 lstm_units = 16 num_classes = 2 max_sequence_length = 4 embedding_dimension = 64 num_iterations = 1000 As shown previously, batch_size dictates how many sequences of tokens we can input in one batch for training. lstm_units represents the total number of LSTM cells in the network. max_sequence_length represents the maximum possible length of a given sequence. Once defined, we now proceed to initialize TensorFlow-specific data structures for input data as follows: import tensorflow as tf labels = tf.placeholder(tf.float32, [batch_size, num_classes]) raw_data = tf.placeholder(tf.int32, [batch_size, max_sequence_length]) Given we are working with word tokens, we would like to represent them using a good feature representation technique. Let us assume the word embedding representation takes a word token and projects it onto an embedding space of dimension, embedding_dimension. The two-dimensional input data containing raw word tokens is now transformed into a three-dimensional word tensor with the added dimension representing the word embedding. We also use pre-computed word embedding, stored in a word_vectors data structure. We initialize the data structures as follows: data = tf.Variable(tf.zeros([batch_size, max_sequence_length, embedding_dimension]),dtype=tf.float32) data = tf.nn.embedding_lookup(word_vectors,raw_data) Now that the input data is ready, we look at defining the LSTM model. As shown previously, we need to create lstm_units of a basic LSTM cell. Since we need to perform a classification at the end, we wrap the LSTM unit with a dropout wrapper. To perform a full temporal pass of the data on the defined network, we unroll the LSTM using a dynamic_rnn routine of TensorFlow. We also initialize a random weight matrix and a constant value of 0.1 as the bias vector, as follows: weight = tf.Variable(tf.truncated_normal([lstm_units, num_classes])) bias = tf.Variable(tf.constant(0.1, shape=[num_classes])) lstm_cell = tf.contrib.rnn.BasicLSTMCell(lstm_units) wrapped_lstm_cell = tf.contrib.rnn.DropoutWrapper(cell=lstm_cell, output_keep_prob=0.8) output, state = tf.nn.dynamic_rnn(wrapped_lstm_cell, data, dtype=tf.float32) Once the output is generated by the dynamic unrolled RNN, we transpose its shape, multiply it by the weight vector, and add a bias vector to it to compute the final prediction value: output = tf.transpose(output, [1, 0, 2]) last = tf.gather(output, int(output.get_shape()[0]) - 1) prediction = (tf.matmul(last, weight) + bias) weight = tf.cast(weight, tf.float64) last = tf.cast(last, tf.float64) bias = tf.cast(bias, tf.float64) Since the initial prediction needs to be refined, we define an objective function with crossentropy to minimize the loss as follows: loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits (logits=prediction, labels=labels)) optimizer = tf.train.AdamOptimizer().minimize(loss) After this sequence of steps, we have a trained, end-to-end LSTM network for sentiment classification of arbitrary length sentences. To summarize, we saw how effectively we can implement LSTM network using TensorFlow. If you are interested to know more, check out this book Deep Learning Essentials which will help you take first steps in training efficient deep learning models and apply them in various practical scenarios.

0
0
16484

article-image-logistic-regression-using-tensorflow

Packt

06 Mar 2018

9 min read

Logistic Regression Using TensorFlow

Packt

06 Mar 2018

9 min read

In this article, by PKS Prakash and Achyutuni Sri Krishna Rao, authors of R Deep Learning Cookbook we will learn how to Perform logistic regression using TensorFlow. In this recipe, we will cover the application of TensorFlow in setting up a logistic regression model. The example will use a similar dataset to that used in the H2O model setup. (For more resources related to this topic, see here.) What is TensorFlow TensorFlow is another open source library developed by the Google Brain Team to build numerical computation models using data flow graphs. The core of TensorFlow was developed in C++ with the wrapper in Python. The tensorflow package in R gives you access to the TensorFlow API composed of Python modules to execute computation models. TensorFlow supports both CPU- and GPU-based computations. The tensorflow package in R calls the Python tensorflow API for execution, which is essential to install the tensorflow package in both R and Python to make R work. The following are the dependencies for tensorflow: Python 2.7 / 3.x R (>3.2) devtools package in R for installing TensorFlow from GitHub TensorFlow in Python pip Getting ready The code for this section is created on Linux but can be run on any operating system. To start modeling, load the tensorflow package in the environment. R loads the default TensorFlow environment variable and also the NumPy library from Python in the np variable: library("tensorflow") # Load TensorFlow np <- import("numpy") # Load numpy library How to do it... The data is imported using a standard function from R, as shown in the following code. The data is imported using the read.csv file and transformed into the matrix format followed by selecting the features used to model as defined in xFeatures and yFeatures. The next step in TensorFlow is to set up a graph to run optimization: # Loading input and test data xFeatures = c("Temperature", "Humidity", "Light", "CO2", "HumidityRatio") yFeatures = "Occupancy" occupancy_train <- as.matrix(read.csv("datatraining.txt",stringsAsFactors = T)) occupancy_test <- as.matrix(read.csv("datatest.txt",stringsAsFactors = T)) # subset features for modeling and transform to numeric values occupancy_train<-apply(occupancy_train[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) occupancy_test<-apply(occupancy_test[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) # Data dimensions nFeatures<-length(xFeatures) nRow<-nrow(occupancy_train) Before setting up the graph, let's reset the graph using the following command: # Reset the graph tf$reset_default_graph() Additionally, let's start an interactive session as it will allow us to execute variables without referring to the session-to-session object: # Starting session as interactive session sess<-tf$InteractiveSession() Define the logistic regression model in TensorFlow: # Setting-up Logistic regression graph x <- tf$constant(unlist(occupancy_train[, xFeatures]), shape=c(nRow, nFeatures), dtype=np$float32) # W <- tf$Variable(tf$random_uniform(shape(nFeatures, 1L))) b <- tf$Variable(tf$zeros(shape(1L))) y <- tf$matmul(x, W) + b The input feature x is defined as a constant as it will be an input to the system. The weight W and bias b are defined as variables that will be optimized during the optimization process. The y is set up as a symbolic representation between x, W, and b. The weight W is set up to initialize random uniform distribution and b is assigned the value zero. The next step is to set up the cost function for logistic regression: # Setting-up cost function and optimizer y_ <- tf$constant(unlist(occupancy_train[, yFeatures]), dtype="float32", shape=c(nRow, 1L)) cross_entropy<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=y_, logits=y, name="cross_entropy")) optimizer <- tf$train$GradientDescentOptimizer(0.15)$minimize(cross_entr opy) # Start a session init <- tf$global_variables_initializer() sess$run(init) Execute the gradient descent algorithm for the optimization of weights using cross entropy as the loss function: # Running optimization for (step in 1:5000) { sess$run(optimizer) if (step %% 20== 0) cat(step, "-", sess$run(W), sess$run(b), "==>", sess$run(cross_entropy), "n") } How it works... The performance of the model can be evaluated using AUC: # Performance on Train library(pROC) ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) # Performance on test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)). AUC can be visualized using the plot.auc function from the pROC package, as shown in the screenshot following this command. The performance for training and testing (holdout) is very similar. plot.roc(roc_obj, col = "green", lty=2, lwd=2) plot.roc(roc_objt, add=T, col="red", lty=4, lwd=2) Performance of logistic regression using TensorFlow Visualizing TensorFlow graphs TensorFlow graphs can be visualized using TensorBoard. It is a service that utilizes TensorFlow event files to visualize TensorFlow models as graphs. Graph model visualization in TensorBoard is also used to debug TensorFlow models. Getting ready TensorBoard can be started using the following command in the terminal: $ tensorboard --logdir home/log --port 6006 The following are the major parameters for TensorBoard: --logdir : To map to the directory to load TensorFlow events --debug: To increase log verbosity --host: To define the host to listen to its localhost (127.0.0.1) by default --port: To define the port to which TensorBoard will serve The preceding command will launch the TensorFlow service on localhost at port 6006, as shown in the following screenshot: TensorBoard The tabs on the TensorBoard capture relevant data generated during graph execution. How to do it... The section covers how to visualize TensorFlow models and output in TernsorBoard. To visualize summaries and graphs, data from TensorFlow can be exported using the FileWriter command from the summary module. A default session graph can be added using the following command: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) The graph for logistic regression developed using the preceding code is shown in the following screenshot: Visualization of the logistic regression graph in TensorBoard Similarly, other variable summaries can be added to the TensorBoard using correct summaries, as shown in the following code: # Adding histogram summary to weight and bias variable w_hist = tf$histogram_summary("weights", W) b_hist = tf$histogram_summary("biases", b) Create a cross entropy evaluation for test. An example script to generate the cross entropy cost function for test and train is shown in the following command: # Set-up cross entropy for test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- tf$nn$sigmoid(tf$matmul(xt, W) + b) yt_ <- tf$constant(unlist(occupancy_test[, yFeatures]), dtype="float32", shape=c(nRowt, 1L)) cross_entropy_tst<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=yt_, logits=ypredt, name="cross_entropy_tst")) Add summary variables to be collected: # Add summary ops to collect data w_hist = tf$summary$histogram("weights", W) b_hist = tf$summary$histogram("biases", b) crossEntropySummary<-tf$summary$scalar("costFunction", cross_entropy) crossEntropyTstSummary<- tf$summary$scalar("costFunction_test", cross_entropy_tst) Open the writing object, log_writer. It writes the default graph to the location, c:/log: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) Run the optimization and collect the summaries: for (step in 1:2500) { sess$run(optimizer) # Evaluate performance on training and test data after 50 Iteration if (step %% 50== 0){ ### Performance on Train ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) ### Performance on Test ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)) cat("train AUC: ", auc(roc_obj), " Test AUC: ", auc(roc_objt), "n") # Save summary of Bias and weights log_writer$add_summary(sess$run(b_hist), global_step=step) log_writer$add_summary(sess$run(w_hist), global_step=step) log_writer$add_summary(sess$run(crossEntropySummary), global_step=step) log_writer$add_summary(sess$run(crossEntropyTstSummary), global_step=step) } } Collect all the summaries to a single tensor using the merge_all command from the summary module: summary = tf$summary$merge_all() Write the summaries to the log file using the log_writer object: log_writer = tf$summary$FileWriter('c:/log', sess$graph) summary_str = sess$run(summary) log_writer$add_summary(summary_str, step) log_writer$close() Summary In this article, we have learned how to perform logistic regression using TensorFlow also we have covered the application of TensorFlow in setting up a logistic regression model. Resources for Article: Further resources on this subject: [article] [article] [article]

0
0
4403

article-image-implement-reinforcement-learning-tensorflow

Gebin George

05 Mar 2018

3 min read

How to implement Reinforcement Learning with TensorFlow

Gebin George

05 Mar 2018

3 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials co-authored by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book will help you get to grips with the essentials of deep learning by leveraging the power of Python.[/box] In today’s tutorial, we will implement reinforcement learning with TensorFlow-based Qlearning algorithm. We will look at a popular game, FrozenLake, which has an inbuilt environment in the OpenAI gym package. The idea behind the FrozenLake game is quite simple. It consists of 4 x 4 grid blocks, where each block can have one of the following four states: S: Starting point/Safe state F: Frozen surface/Safe state H: Hole/Unsafe state G: Goal/Safe or Terminal state In each of the 16 cells, you can use one of the four actions, namely up/down/left/right, to move to a neighboring state. The goal of the game is to start from state S and end at state G. We will show how we can use a neural network-based Q-learning system to learn a safe path from state S to state G. First, we import the necessary packages and define the game environment: import gym import numpy as np import random import tensorflow as tf env = gym.make('FrozenLake-v0') Once the environment is defined, we can define the network structure that learns the Qvalues. We will use a one-layer neural network with 16 hidden neurons and 4 output neurons as follows: input_matrix = tf.placeholder(shape=[1,16],dtype=tf.float32) weight_matrix = tf.Variable(tf.random_uniform([16,4],0,0.01)) Q_matrix = tf.matmul(input_matrix,weight_matrix) prediction_matrix = tf.argmax(Q_matrix,1) nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32) loss = tf.reduce_sum(tf.square(nextQ - Q_matrix)) train = tf.train.GradientDescentOptimizer(learning_rate=0.05) model = train.minimize(loss) init_op = tf.global_variables_initializer() Now we can choose the action greedily: ip_q = np.zeros(num_states) ip_q[current_state] = 1 a,allQ = sess.run([prediction_matrix,Q_matrix],feed_dict={input_matrix: [ip_q]}) if np.random.rand(1) < sample_epsilon: a[0] = env.action_space.sample() next_state, reward, done, info = env.step(a[0]) ip_q1 = np.zeros(num_states) ip_q1[next_state] = 1 Q1 = sess.run(Q_matrix,feed_dict={input_matrix:[ip_q1]}) maxQ1 = np.max(Q1) targetQ = allQ targetQ[0,a[0]] = reward + y*maxQ1 _,W1 = sess.run([model,weight_matrix],feed_dict={input_matrix: [ip_q],nextQ:targetQ}) Figure RL with Q-learning example shows the sample output of the program when executed. You can see different values of Q matrix as the agent moves from one state to the other. You also notice a value of reward 1 when the agent is in state 15: To summarize, we saw how reinforcement learning can be practically implemented using TensorFlow. If you found this post useful, do check out the book Deep Learning Essentials which will help you fine-tune and optimize your deep learning models for better performance.

0
0
17755

article-image-how-to-compute-interpolation-in-scipy

Pravin Dhandre

05 Mar 2018

8 min read

How to Compute Interpolation in SciPy

Pravin Dhandre

05 Mar 2018

8 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides numerous recipes in mastering common tasks related to SciPy and associated libraries such as NumPy, pandas, and matplotlib.[/box] In today’s tutorial, we will see how to compute and solve polynomial, univariate interpolations using SciPy with detailed process and instructions. In this recipe, we will look at how to compute data polynomial interpolation by applying some important methods which are discussed in detail in the coming How to do it... section. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… Let's get started. In the following steps, we will explain how to compute a polynomial interpolation and the things we need to know: They require the following parameters: points: An ndarray of floats, shape (n, D) data point coordinates. It can be either an array of shape (n, D) or a tuple of ndim arrays. values: An ndarray of float or complex shape (n,) data values. xi: A 2D ndarray of float or tuple of 1D array, shape (M, D). Points at which to interpolate data. method: A {'linear', 'nearest', 'cubic'}—This is an optional method of interpolation. One of the nearest return value is at the data point closest to the point of interpolation. See NearestNDInterpolator for more details. linear tessellates the input point set to n-dimensional simplices, and interpolates linearly on each simplex. See LinearNDInterpolator for more details. cubic (1D): Returns the value determined from a cubic spline. cubic (2D): Returns the value determined from a piecewise cubic, continuously differentiable (C1), and approximately curvature-minimizing polynomial surface. See CloughTocher2DInterpolator for more details. fill_value: float; optional. It is the value used to fill in for requested points outside of the convex hull of the input points. If it is not provided, then the default is nan. This option has no effect on the nearest method. rescale: bool; optional. Rescale points to the unit cube before performing interpolation. This is useful if some of the input dimensions have non-commensurable units and differ by many orders of magnitude. How it works… One can see that the exact result is reproduced by all of the methods to some degree, but for this smooth function, the piecewise cubic interpolant gives the best results: import matplotlib.pyplot as plt import numpy as np methods = [None, 'none', 'nearest', 'bilinear', 'bicubic', 'spline16', 'spline36', 'hanning', 'hamming', 'hermite', 'kaiser', 'quadric', 'catrom', 'gaussian', 'bessel', 'mitchell', 'sinc', 'lanczos'] # Fixing random state for reproducibility np.random.seed(19680801) grid = np.random.rand(4, 4) fig, axes = plt.subplots(3, 6, figsize=(12, 6), subplot_kw={'xticks': [], 'yticks': []}) fig.subplots_adjust(hspace=0.3, wspace=0.05) for ax, interp_method in zip(axes.flat, methods): ax.imshow(grid, interpolation=interp_method, cmap='viridis') ax.set_title(interp_method) plt.show() This is the result of the execution: Univariate interpolation In the next section, we will look at how to solve univariate interpolation. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… The following table summarizes the different univariate interpolation modes coded in SciPy, together with the processes that we may use to resolve them: Finding a cubic spline that interpolates a set of data In this recipe, we will look at how to find a cubic spline that interpolates with the main method of spline. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… We can use the following functions to solve the problems with this parameter: x: array_like, shape (n,). A 1D array containing values of the independent variable. The values must be real, finite, and in strictly increasing order. y: array_like. An array containing values of the dependent variable. It can have an arbitrary number of dimensions, but the length along axis must match the length of x. The values must be finite. axis: int; optional. The axis along which y is assumed to be varying, meaning for x[i], the corresponding values are np.take(y, i, axis=axis). The default is 0. bc_type: String or two-tuple; optional. Boundary condition type. Two additional equations, given by the boundary conditions, are required to determine all coefficients of polynomials on each segment. Refer to: https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.interpolate.CubicSpline.html#r59. If bc_type is a string, then the specified condition will be applied at both ends of a spline. The available conditions are: not-a-knot (default): The first and second segment at a curve end are the same polynomial. This is a good default when there is no information about boundary conditions. periodic: The interpolated function is assumed to be periodic in the period x[-1] - x[0]. The first and last value of y must be identical: y[0] == y[-1]. This boundary condition will result in y'[0] == y'[-1] and y''[0] == y''[-1]. clamped: The first derivatives at the curve ends are zero. Assuming there is a 1D y, bc_type=((1, 0.0), (1, 0.0)) is the same condition. natural: The second derivatives at the curve ends are zero. Assuming there is a 1D y, bc_type=((2, 0.0), (2, 0.0)) is the same condition. If bc_type is two-tuple, the first and the second value will be applied at the curve's start and end respectively. The tuple value can be one of the previously mentioned strings (except periodic) or a tuple (order, deriv_values), allowing us to specify arbitrary derivatives at curve ends: order: The derivative order; it is 1 or 2. deriv_value: An array_like containing derivative values. The shape must be the same as y, excluding the axis dimension. For example, if y is 1D, then deriv_value must be a scalar. If y is 3D with shape (n0, n1, n2) and axis=2, then deriv_value must be 2D and have the shape (n0, n1). extrapolate: {bool, 'periodic', None}; optional. bool, determines whether or not to extrapolate to out-of-bounds points based on first and last intervals, or to return NaNs. periodic, periodic extrapolation is used. If none (default), extrapolate is set to periodic for bc_type='periodic' and to True otherwise. How it works... We have the following example: %pylab inline from scipy.interpolate import CubicSpline import matplotlib.pyplot as plt x = np.arange(10) y = np.sin(x) cs = CubicSpline(x, y) xs = np.arange(-0.5, 9.6, 0.1) plt.figure(figsize=(6.5, 4)) plt.plot(x, y, 'o', label='data') plt.plot(xs, np.sin(xs), label='true') plt.plot(xs, cs(xs), label="S") plt.plot(xs, cs(xs, 1), label="S'") plt.plot(xs, cs(xs, 2), label="S''") plt.plot(xs, cs(xs, 3), label="S'''") plt.xlim(-0.5, 9.5) plt.legend(loc='lower left', ncol=2) plt.show() We can see the result here: We see the next example: theta = 2 * np.pi * np.linspace(0, 1, 5) y = np.c_[np.cos(theta), np.sin(theta)] cs = CubicSpline(theta, y, bc_type='periodic') print("ds/dx={:.1f} ds/dy={:.1f}".format(cs(0, 1)[0], cs(0, 1)[1])) x=0.0 ds/dy=1.0 xs = 2 * np.pi * np.linspace(0, 1, 100) plt.figure(figsize=(6.5, 4)) plt.plot(y[:, 0], y[:, 1], 'o', label='data') plt.plot(np.cos(xs), np.sin(xs), label='true') plt.plot(cs(xs)[:, 0], cs(xs)[:, 1], label='spline') plt.axes().set_aspect('equal') plt.legend(loc='center') plt.show() In the following screenshot, we can see the final result: Defining a B-spline for a given set of control points In the next section, we will look at how to solve B-splines given some controlled data. Getting ready We need to follow some instructions and install the prerequisites. How to do it… Univariate the spline in the B-spline basis Execute the following: S(x)=∑j=0n-1cjBj,k;t(x)S(x)=∑j=0n-1cjBj,k;t(x) Where it's Bj,k;tBj,k;t are B-spline basis functions of degree k and knots t We can use the following parameters: How it works ... Here, we construct a quadratic spline function on the base interval 2 <= x <= 4 and compare it with the naive way of evaluating the spline: from scipy import interpolate import numpy as np import matplotlib.pyplot as plt # sampling x = np.linspace(0, 10, 10) y = np.sin(x) # spline trough all the sampled points tck = interpolate.splrep(x, y) x2 = np.linspace(0, 10, 200) y2 = interpolate.splev(x2, tck) # spline with all the middle points as knots (not working yet) # knots = x[1:-1] # it should be something like this knots = np.array([x[1]]) # not working with above line and just seeing what this line does weights = np.concatenate(([1],np.ones(x.shape[0]-2)*.01,[1])) tck = interpolate.splrep(x, y, t=knots, w=weights) x3 = np.linspace(0, 10, 200) y3 = interpolate.splev(x2, tck) # plot plt.plot(x, y, 'go', x2, y2, 'b', x3, y3,'r') plt.show() Note that outside of the base interval, results differ. This is because BSpline extrapolates the first and last polynomial pieces of B-spline functions active on the base interval. This is the result of solving the problem: We successfully compute numerical computation and find interpolation function using polynomial and univariate interpolation coded in SciPy. If you found this tutorial useful, do check out the book SciPy Recipes to get quick recipes for performing other mathematical operations like differential equation, K-means and Discrete Fourier Transform.

0
0
11456

article-image-compute-discrete-fourier-transform-dft-using-scipy

Pravin Dhandre

02 Mar 2018

5 min read

How to compute Discrete Fourier Transform (DFT) using SciPy

Pravin Dhandre

02 Mar 2018

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides numerous recipes to tackle day-to-day challenges associated with scientific computing and data manipulation using SciPy stack.[/box] Today, we will compute Discrete Fourier Transform (DFT) and inverse DFT using SciPy stack. In this article, we will focus majorly on the syntax and the application of DFT in SciPy assuming you are well versed with the mathematics of this concept. Discrete Fourier Transforms A discrete Fourier transform transforms any signal from its time/space domain into a related signal in frequency domain. This allows us to not only analyze the different frequencies of the data, but also enables faster filtering operations, when used properly. It is possible to turn a signal in a frequency domain back to its time/spatial domain, thanks to inverse Fourier transform (IFT). How to do it… To follow with the example, we need to continue with the following steps: The basic routines in the scipy.fftpack module compute the DFT and its inverse, for discrete signals in any dimension—fft, ifft (one dimension), fft2, ifft2 (two dimensions), and fftn, ifftn (any number of dimensions). Verify all these routines assume that the data is complex valued. If we know beforehand that a particular dataset is actually real-valued, and should offer realvalued frequencies, we use rfft and irfft instead, for a faster algorithm. In order to complete with this, these routines are designed so that composition with their inverses always yields the identity. The syntax is the same in all cases, as follows: fft(x[, n, axis, overwrite_x]) The first parameter, x, is always the signal in any array-like form. Note that fft performs one-dimensional transforms. This means that if x happens to be two-dimensional, for example, fft will output another two-dimensional array, where each row is the transform of each row of the original. We can use columns instead, with the optional axis parameter. The rest of the parameters are also optional; n indicates the length of the transform and overwrite_x gets rid of the original data to save memory and resources. We usually play with the n integer when we need to pad the signal with zeros or truncate it. For a higher dimension, n is substituted by shape (a tuple) and axis by axes (another tuple). To better understand the output, it is often useful to shift the zero frequencies to the center of the output arrays with ifftshift. The inverse of this operation, ifftshift, is also included in the module. How it works… The following code shows some of these routines in action when applied to a checkerboard: import numpy from scipy.fftpack import fft,fft2, fftshift import matplotlib.pyplot as plt B=numpy.ones((4,4)); W=numpy.zeros((4,4)) signal = numpy.bmat("B,W;W,B") onedimfft = fft(signal,n=16) twodimfft = fft2(signal,shape=(16,16)) plt.figure() plt.gray() plt.subplot(121,aspect='equal') plt.pcolormesh(onedimfft.real) plt.colorbar(orientation='horizontal') plt.subplot(122,aspect='equal') plt.pcolormesh(fftshift(twodimfft.real)) plt.colorbar(orientation='horizontal') plt.show() Note how the first four rows of the one-dimensional transform are equal (and so are the last four), while the two-dimensional transform (once shifted) presents a peak at the origin and nice symmetries in the frequency domain. In the following screenshot, which has been obtained from the previous code, the image on the left is the fft and the one on the right is the fft2 of a 2 x 2 checkerboard signal: Computing the discrete Fourier transform (DFT) of a data series using the FFT Algorithm In this section, we will see how to compute the discrete Fourier transform and some of its Applications. How to do it… In the following table, we will see the parameters to create a data series using the FFT algorithm: How it works… This code represents computing an FFT discrete Fourier in the main part: np.fft.fft(np.exp(2j * np.pi * np.arange(8) / 8)) array([ -3.44505240e-16 +1.14383329e-17j, 8.00000000e+00 -5.71092652e-15j, 2.33482938e-16 +1.22460635e-16j, 1.64863782e-15 +1.77635684e-15j, 9.95839695e-17 +2.33482938e-16j, 0.00000000e+00 +1.66837030e-15j, 1.14383329e-17 +1.22460635e-16j, -1.64863782e-15 +1.77635684e-15j]) In this example, real input has an FFT that is Hermitian, that is, symmetric in the real part and anti-symmetric in the imaginary part, as described in the numpy.fft documentation. import matplotlib.pyplot as plt t = np.arange(256) sp = np.fft.fft(np.sin(t)) freq = np.fft.fftfreq(t.shape[-1]) plt.plot(freq, sp.real, freq, sp.imag) [<matplotlib.lines.Line2D object at 0x...>, <matplotlib.lines.Line2D object at 0x...>] plt.show() The following screenshot shows how we represent the results: Computing the inverse DFT of a data series In this section, we will learn how to compute the inverse DFT of a data series. How to do it… In this section we will see how to compute the inverse Fourier transform. The returned complex array contains y(0), y(1),..., y(n-1) where: How it works… In this part, we represent the calculous of the DFT: np.fft.ifft([0, 4, 0, 0]) array([ 1.+0.j, 0.+1.j, -1.+0.j, 0.-1.j]) Create and plot a band-limited signal with random phases: import matplotlib.pyplot as plt t = np.arange(400) n = np.zeros((400,), dtype=complex) n[40:60] = np.exp(1j*np.random.uniform(0, 2*np.pi, (20,))) s = np.fft.ifft(n) plt.plot(t, s.real, 'b-', t, s.imag, 'r--') plt.legend(('real', 'imaginary')) plt.show() Then we represent it, as shown in the following screenshot: We successfully explored how to transform signals from time or space domain into frequency domain and vice-versa, allowing you to analyze frequencies in detail. If you found this tutorial useful, do check out the book SciPy Recipes to get hands-on recipes to perform various data science tasks with ease.

0
1
40087

article-image-how-to-use-mapreduce-with-mongo-shell

Amey Varangaonkar

02 Mar 2018

8 min read

How to use MapReduce with Mongo shell

Amey Varangaonkar

02 Mar 2018

8 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x authored by Alex Giamas. This book demonstrates the power of MongoDB to build high performance database solutions with ease.[/box] MongoDB is one of the most popular NoSQL databases in the world and can be combined with various Big Data tools for efficient data processing. In this article we explore interesting features of MongoDB, which has been underappreciated and not widely supported throughout the industry as yet - the ability to write MapReduce natively using shell. MapReduce is a data processing method for getting aggregate results from a large set of data. The main advantage is that it is inherently parallelizable as evidenced by frameworks such as Hadoop. A simple example of MapReduce would be as follows, given that our input books collection is as follows: > db.books.find() { "_id" : ObjectId("592149c4aabac953a3a1e31e"), "isbn" : "101", "name" : "Mastering MongoDB", "price" : 30 } { "_id" : ObjectId("59214bc1aabac954263b24e0"), "isbn" : "102", "name" : "MongoDB in 7 years", "price" : 50 } { "_id" : ObjectId("59214bc1aabac954263b24e1"), "isbn" : "103", "name" : "MongoDB for experts", "price" : 40 } And our map and reduce functions are defined as follows: > var mapper = function() { emit(this.id, 1); }; In this mapper, we simply output a key of the id of each document with a value of 1: > var reducer = function(id, count) { return Array.sum(count); }; In the reducer, we sum across all values (where each one has a value of 1): > db.books.mapReduce(mapper, reducer, { out:"books_count" }); { "result" : "books_count", "timeMillis" : 16613, "counts" : { "input" : 3, "emit" : 3, "reduce" : 1, "output" : 1 }, "ok" : 1 } > db.books_count.find() { "_id" : null, "value" : 3 } > Our final output is a document with no ID, since we didn't output any value for id, and a value of 6, since there are six documents in the input dataset. Using MapReduce, MongoDB will apply map to each input document, emitting key-value pairs at the end of the map phase. Then each reducer will get key-value pairs with the same key as input, processing all multiple values. The reducer's output will be a single key-value pair for each key. Optionally, we can use a finalize function to further process the results of the mapper and reducer. MapReduce functions use JavaScript and run within the mongod process. MapReduce can output inline as a single document, subject to the 16 MB document size limit, or as multiple documents in an output collection. Input and output collections can be sharded. MapReduce concurrency MapReduce operations will place several short-lived locks that should not affect operations. However, at the end of the reduce phase, if we are outputting data to an existing collection, then output actions such as merge, reduce, and replace will take an exclusive global write lock for the whole server, blocking all other writes in the db instance. If we want to avoid that we should invoke MapReduce in the following way: > db.collection.mapReduce( Mapper, Reducer, { out: { merge/reduce: bookOrders, nonAtomic: true } }) We can apply nonAtomic only to merge or reduce actions. replace will just replace the contents of documents in bookOrders, which would not take much time anyway. With the merge action, the new result is merged with the existing result if the output collection already exists. If an existing document has the same key as the new result, then it will overwrite that existing document. With the reduce action, the new result is processed together with the existing result if the output collection already exists. If an existing document has the same key as the new result, it will apply the reduce function to both the new and the existing documents and overwrite the existing document with the result. Although MapReduce has been present since the early versions of MongoDB, it hasn't evolved as much as the rest of the database, resulting in its usage being less than that of specialized MapReduce frameworks such as Hadoop. Incremental MapReduce Incremental MapReduce is a pattern where we use MapReduce to aggregate to previously calculated values. An example would be counting non-distinct users in a collection for different reporting periods (that is, hour, day, month) without the need to recalculate the result every hour. To set up our data for incremental MapReduce we need to do the following: Output our reduce data to a different collection At the end of every hour, query only for the data that got into the collection in the last hour With the output of our reduce data, merge our results with the calculated results from the previous hour Following up on the previous example, let's assume that we have a published field in each of the documents, with our input dataset being: > db.books.find() { "_id" : ObjectId("592149c4aabac953a3a1e31e"), "isbn" : "101", "name" : "Mastering MongoDB", "price" : 30, "published" : ISODate("2017-06-25T00:00:00Z") } { "_id" : ObjectId("59214bc1aabac954263b24e0"), "isbn" : "102", "name" : "MongoDB in 7 years", "price" : 50, "published" : ISODate("2017-06-26T00:00:00Z") } Using our previous example of counting books we would get the following: var mapper = function() { emit(this.id, 1); }; var reducer = function(id, count) { return Array.sum(count); }; > db.books.mapReduce(mapper, reducer, { out: "books_count" }) { "result" : "books_count", "timeMillis" : 16700, "counts" : { "input" : 2, "emit" : 2, "reduce" : 1, "output" : 1 }, "ok" : 1 } > db.books_count.find() { "_id" : null, "value" : 2 } Now we get a third book in our mongo_books collection with a document: { "_id" : ObjectId("59214bc1aabac954263b24e1"), "isbn" : "103", "name" : "MongoDB for experts", "price" : 40, "published" : ISODate("2017-07-01T00:00:00Z") } > db.books.mapReduce( mapper, reducer, { query: { published: { $gte: ISODate('2017-07-01 00:00:00') } }, out: { reduce: "books_count" } } ) > db.books_count.find() { "_id" : null, "value" : 3 } What happened here, is that by querying for documents in July 2017 we only got the new document out of the query and then used its value to reduce the value with the already calculated value of 2 in our books_count document, adding 1 to the final sum of three documents. This example, as contrived as it is, shows a powerful attribute of MapReduce: the ability to re-reduce results to incrementally calculate aggregations over time. Troubleshooting MapReduce Throughout the years, one of the major shortcomings of MapReduce frameworks has been the inherent difficulty in troubleshooting as opposed to simpler non-distributed patterns. Most of the time, the most effective tool is debugging using log statements to verify that output values match our expected values. In the mongo shell, this being a JavaScript shell, this is as simple as outputting using the console.log()function. Diving deeper into MapReduce in MongoDB we can debug both in the map and the reduce phase by overloading the output values. Debugging the mapper phase, we can overload the emit() function to test what the output key values are: > var emit = function(key, value) { print("debugging mapper's emit"); print("key: " + key + " value: " + tojson(value)); } We can then call it manually on a single document to verify that we get back the key-value pair that we would expect: > var myDoc = db.orders.findOne( { _id: ObjectId("50a8240b927d5d8b5891743c") } ); > mapper.apply(myDoc); The reducer function is somewhat more complicated. A MapReduce reducer function must meet the following criteria: It must be idempotent The order of values coming from the mapper function should not matter for the reducer's result The reduce function must return the same type of result as the mapper function We will dissect these following requirements to understand what they really mean: It must be idempotent: MapReduce by design may call the reducer multiple times for the same key with multiple values from the mapper phase. It also doesn't need to reduce single instances of a key as it's just added to the set. The final value should be the same no matter the order of execution. This can be verified by writing our own "verifier" function forcing the reducer to re-reduce or by executing the reducer many, many times: reduce( key, [ reduce(key, valuesArray) ] ) == reduce( key, valuesArray ) It must be commutative: Again, because multiple invocations of the reducer may happen for the same key, if it has multiple values, the following should hold: reduce(key, [ C, reduce(key, [ A, B ]) ] ) == reduce( key, [C, A, B ] ) The order of values coming from the mapper function should not matter for the reducer's result: We can test that the order of values from the mapper doesn't change the output for the reducer by passing in documents to the mapper in a different order and verifying that we get the same results out: reduce( key, [ A, B ] ) == reduce( key, [ B, A ] ) The reduce function must return the same type of result as the mapper function: Hand-in-hand with the first requirement, the type of object that the reduce function returns should be the same as the output of the mapper function. We saw how MapReduce is useful when implemented on a data pipeline. Multiple MapReduce commands can be chained to produce different results. An example would be aggregating data by different reporting periods (hour, day, week, month, year) where we use the output of each more granular reporting period to produce a less granular report. If you found this article useful, make sure to check our book Mastering MongoDB 3.x to get more insights and information about MongoDB’s vast data storage, management and administration capabilities.

0
0
10940

article-image-implementing-apache-spark-k-means-clustering-method-on-digital-breath-test-data-for-road-safety

Savia Lobo

01 Mar 2018

7 min read

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety

Savia Lobo

01 Mar 2018

7 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. In this book, you will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks.[/box] In today’s tutorial, we have used the Road Safety test data from our previous article, to show how one can attempt to find clusters in data using K-Means algorithm with Apache Spark MLlib. Theory on Clustering The K-Means algorithm iteratively attempts to determine clusters within the test data by minimizing the distance between the mean value of cluster center vectors, and the new candidate cluster member vectors. The following equation assumes dataset members that range from X1 to Xn; it also assumes K cluster sets that range from S1 to Sk, where K <= n. K-Means in practice The K-Means MLlib functionality uses the LabeledPoint structure to process its data and so it needs numeric input data. As the same data from the last section is being reused, we will not explain the data conversion again. The only change that has been made in data terms in this section, is that processing in HDFS will now take place under the /data/spark/kmeans/ directory. Additionally, the conversion Scala script for the K-Means example produces a record that is all comma-separated. The development and processing for the K-Means example has taken place under the /home/hadoop/spark/kmeans directory to separate the work from other development. The sbt configuration file is now called kmeans.sbt and is identical to the last example, except for the project name: name := "K-Means" The code for this section can be found in the software package under chapter7K-Means. So, looking at the code for kmeans1.scala, which is stored under kmeans/src/main/scala, some similar actions occur. The import statements refer to the Spark context and configuration. This time, however, the K-Means functionality is being imported from MLlib. Additionally, the application class name has been changed for this example to kmeans1: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.clustering.{KMeans,KMeansModel} object kmeans1 extends App { The same actions are being taken as in the last example to define the data file--to define the Spark configuration and create a Spark context: val hdfsServer = "hdfs://localhost:8020" val hdfsPath = "/data/spark/kmeans/" val dataFile = hdfsServer + hdfsPath + "DigitalBreathTestData2013- MALE2a.csv" val sparkMaster = "spark://localhost:7077" val appName = "K-Means 1" val conf = new SparkConf() conf.setMaster(sparkMaster) conf.setAppName(appName) val sparkCxt = new SparkContext(conf) Next, the CSV data is loaded from the data file and split by comma characters into the VectorData variable: val csvData = sparkCxt.textFile(dataFile) val VectorData = csvData.map { csvLine => Vectors.dense( csvLine.split(',').map(_.toDouble)) } A KMeans object is initialized, and the parameters are set to define the number of clusters and the maximum number of iterations to determine them: val kMeans = new KMeans val numClusters = 3 val maxIterations = 50 Some default values are defined for the initialization mode, number of runs, and Epsilon, which we needed for the K-Means call but did not vary for the processing. Finally, these parameters were set against the KMeans object: val initializationMode = KMeans.K_MEANS_PARALLEL val numRuns = 1 val numEpsilon = 1e-4 kMeans.setK( numClusters ) kMeans.setMaxIterations( maxIterations ) kMeans.setInitializationMode( initializationMode ) kMeans.setRuns( numRuns ) kMeans.setEpsilon( numEpsilon ) We cached the training vector data to improve the performance and trained the KMeans object using the vector data to create a trained K-Means model: VectorData.cache val kMeansModel = kMeans.run( VectorData ) We have computed the K-Means cost and number of input data rows, and have output the results via println statements. The cost value indicates how tightly the clusters are packed and how separate the clusters are: val kMeansCost = kMeansModel.computeCost( VectorData ) println( "Input data rows : " + VectorData.count() ) println( "K-Means Cost : " + kMeansCost ) Next, we have used the K-Means Model to print the cluster centers as vectors for each of the three clusters that were computed: kMeansModel.clusterCenters.foreach{ println } Finally, we use the K-Means model predict function to create a list of cluster membership predictions. We then count these predictions by value to give a count of the data points in each cluster. This shows which clusters are bigger and whether there really are three clusters: val clusterRddInt = kMeansModel.predict( VectorData ) val clusterCount = clusterRddInt.countByValue clusterCount.toList.foreach{ println } } // end object kmeans1 So, in order to run this application, it must be compiled and packaged from the kmeans subdirectory as the Linux pwd command shows here: [hadoop@hc2nn kmeans]$ pwd /home/hadoop/spark/kmeans [hadoop@hc2nn kmeans]$ sbt package Loading /usr/share/sbt/bin/sbt-launch-lib.bash [info] Set current project to K-Means (in build file:/home/hadoop/spark/kmeans/) [info] Compiling 2 Scala sources to /home/hadoop/spark/kmeans/target/scala-2.10/classes... [info] Packaging /home/hadoop/spark/kmeans/target/scala-2.10/k- means_2.10-1.0.jar ... [info] Done packaging. [success] Total time: 20 s, completed Feb 19, 2015 5:02:07 PM Once this packaging is successful, we check HDFS to ensure that the test data is ready. As in the last example, we convert our data to numeric form using the convert.scala file, provided in the software package. We will process the DigitalBreathTestData2013- MALE2a.csv data file in the HDFS directory, /data/spark/kmeans, as follows: [hadoop@hc2nn nbayes]$ hdfs dfs -ls /data/spark/kmeans Found 3 items -rw-r--r-- 3 hadoop supergroup 24645166 2015-02-05 21:11 /data/spark/kmeans/DigitalBreathTestData2013-MALE2.csv -rw-r--r-- 3 hadoop supergroup 5694226 2015-02-05 21:48 /data/spark/kmeans/DigitalBreathTestData2013-MALE2a.csv drwxr-xr-x - hadoop supergroup 0 2015-02-05 21:46 /data/spark/kmeans/result The spark-submit tool is used to run the K-Means application. The only change in this command is that the class is now kmeans1: spark-submit --class kmeans1 --master spark://localhost:7077 --executor-memory 700M --total-executor-cores 100 /home/hadoop/spark/kmeans/target/scala-2.10/k-means_2.10-1.0.jar The output from the Spark cluster run is shown to be as follows: Input data rows : 467054 K-Means Cost : 5.40312223450789E7 The previous output shows the input data volume, which looks correct; it also shows the K- Means cost value. The cost is based on the Within Set Sum of Squared Errors (WSSSE) which basically gives a measure how well the found cluster centroids are matching the distribution of the data points. The better they are matching, the lower the cost. The following link https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/ explains WSSSE and how to find a good value for k in more detail. Next come the three vectors, which describe the data cluster centers with the correct number of dimensions. Remember that these cluster centroid vectors will have the same number of columns as the original vector data: [0.24698249738061878,1.3015883142472253,0.005830116872250263,2.917374778855 5207,1.156645130895448,3.4400290524342454] [0.3321793984152627,1.784137241326256,0.007615970459266097,2.58319870759289 17,119.58366028156011,3.8379106085083468] [0.25247226760684494,1.702510963969387,0.006384899819416975,2.2314042480006 88,52.202897927594805,3.551509158139135] Finally, cluster membership is given for clusters 1 to 3 with cluster 1 (index 0) having the largest membership at 407539 member vectors: (0,407539) (1,12999) (2,46516) To summarize, we saw a practical example that shows how K-means algorithm is used to cluster data with the help of Apache Spark. If you found this post useful, do check out this book Mastering Apache Spark 2.x - Second Edition to learn about the latest enhancements in Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.

0
1
16136

How-To Tutorials - Data

Learning the Salesforce Analytics Query Language (SAQL)

How to perform data partitioning in PostgreSQL 10

How to perform Audio-Video-Image Scraping with Python

4 common challenges in Web Scraping and how to handle them

Spam Filtering - Natural Language Processing Approach

Data Exploration using Spark SQL

Implementing matrix operations using SciPy and NumPy

How to set up a Deep Learning System on Amazon Web Services (AWS)

Implement Long-short Term Memory (LSTM) with TensorFlow

Logistic Regression Using TensorFlow

Trending Topics

How to implement Reinforcement Learning with TensorFlow

How to Compute Interpolation in SciPy

How to compute Discrete Fourier Transform (DFT) using SciPy

How to use MapReduce with Mongo shell

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading