Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7008 Articles
article-image-how-to-perform-audio-video-image-scraping-with-python
Amarabha Banerjee
08 Mar 2018
9 min read
Save for later

How to perform Audio-Video-Image Scraping with Python

Amarabha Banerjee
08 Mar 2018
9 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server. Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally. Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It's a simple step from there to also transcode video with ffmpeg. Downloading media content from the web Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content. Getting ready There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file. How to do it Here is how we proceed with the recipe: The URLUtility class can download content from a URL. The code in the recipe's file is the following: import const from util.urls import URLUtility util = URLUtility(const.ApodEclipseImage()) print(len(util.data)) When running this you will see the following output:  Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014 The example reads 171014 bytes of data. How it works The URL is defined as a constant const.ApodEclipseImage() in the const module: def ApodEclipseImage(): return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg" The constructor of the URLUtility class has the following implementation: def __init__(self, url, readNow=True): """ Construct the object, parse the URL, and download now if specified""" self._url = url self._response = None self._parsed = urlparse(url) if readNow: self.read() The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method: def read(self): self._response = urllib.request.urlopen(self._url) self._data = self._response.read() This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property: @property def data(self): self.ensure_response() return self._data The code then simply reports on the length of that data, with the value of 171014. There's more This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next. Parsing a URL with urllib to get the filename When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name? Getting ready We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py. How to do it Execute the recipe's file with your python interpreter. It will run the following code: util = URLUtility(const.ApodEclipseImage()) print(util.filename_without_ext) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The filename is: BT5643s How it works In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively: >>> parsed = urlparse(const.ApodEclipseImage()) >>> parsed ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='') The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension: @property def filename_without_ext(self): filename = os.path.splitext(os.path.basename(self._parsed.path))[0] return filename The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename). There's more It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That's our next recipe. Determining the type of content for a URL When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content. Getting ready We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py. How to do it We proceed as follows: Execute the script for the recipe. It contains the following code: util = URLUtility(const.ApodEclipseImage()) print("The content type is: " + util.contenttype) With the following result: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The content type is: image/jpeg How it works The .contentype property is implemented as follows: @property def contenttype(self): self.ensure_response() return self._response.headers['content-type'] The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed. There's more The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned: >>> response = urllib.request.urlopen(const.ApodEclipseImage()) >>> for header in response.headers: print(header) Date Server Last-Modified ETag Accept-Ranges Content-Length Connection Content-Type Strict-Transport-Security And we can see the values for each of these headers. >>> for header in response.headers: print(header + " ==> " + response.headers[header]) Date ==> Tue, 26 Sep 2017 19:31:41 GMT Server ==> WebServer/1.0 Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT ETag ==> "547bb44-29c06-5581275ce2b86" Accept-Ranges ==> bytes Content-Length ==> 171014 Connection ==> close Content-Type ==> image/jpeg Strict-Transport-Security ==> max-age=31536000; includeSubDomains Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist. Determining the file extension from a content type It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file. Getting ready We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):. How to do it Proceed by running the recipe's script. An extension for the media type can be found using the .extension property: util = URLUtility(const.ApodEclipseImage()) print("Filename from content-type: " + util.extension_from_contenttype) print("Filename from url: " + util.extension_from_url) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes Filename from content-type: .jpg Filename from url: .jpg This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same. How it works The following is the implementation of the .extension_from_contenttype property: @property def extension_from_contenttype(self): self.ensure_response() map = const.ContentTypeToExtensions() if self.contenttype in map: return map[self.contenttype] return None The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension: def ContentTypeToExtensions(): return { "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/png": ".png" } If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url: @property def extension_from_url(self): ext = os.path.splitext(os.path.basename(self._parsed.path))[1] return ext This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename. To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python. If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.
Read more
  • 0
  • 0
  • 33710

article-image-4-common-challenges-web-scraping-handle
Amarabha Banerjee
08 Mar 2018
13 min read
Save for later

4 common challenges in Web Scraping and how to handle them

Amarabha Banerjee
08 Mar 2018
13 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] In this article, we will explore primary challenges of Web Scraping and how to get away with it easily. Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds. Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands: docker pull mheydt/pywebscrapecookbook docker run -p 5001:5001 pywebscrapecookbook Retrying failed page downloads Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes: [500, 502, 503, 504, 408] The process can be further configured using the following parameters: RETRY_ENABLED (True/False - default is True) RETRY_TIMES (# of times to retry on any errors - default is 2) RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408]) How to do it The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.retry.RetryMiddleware": 500 }, 'RETRY_ENABLED': True, 'RETRY_TIMES': 3 }) process.crawl(Spider) process.start() How it works Scrapy will pick up the configuration for retries as specified when the spider is run. When encountering errors, Scrapy will retry up to three times before giving up. Supporting page redirects Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters: REDIRECT_ENABLED: (True/False - default is True) REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20) How to do it The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are a few lines demonstrating the output: Parsing: <200 https://www.nasa.gov/content/earth-expeditions-above/> ['http://www.nasa.gov/content/earth-expeditions-above', 'https://www.nasa.gov/content/earth-expeditions-above'] This particular URL was processed after one redirection, from an HTTP to an HTTPS version of the URL. The list defines all of the URLs that were involved in the redirection. You will also be able to see where redirection exceeded the specified level (2) in the output pages. The following is one example: 2017-10-22 17:55:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://www.nasa.gov/topics/journeytomars/news/index.html>: max redirections reached How it works The spider is defined as the following: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] def parse(self, response): print("Parsing: ", response) print (response.request.meta.get('redirect_urls')) This is identical to our previous NASA sitemap based crawler, with the addition of one line printing the redirect_urls. In any call to parse, this metadata will contain all redirects that occurred to get to this page. The crawling process is configured with the following code: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 500 }, 'REDIRECT_ENABLED': True, 'REDIRECT_MAX_TIMES': 2 }) Redirect is enabled by default, but this sets the maximum number of redirects to 2 instead of the default of 20. Waiting for content to be available in Selenium A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asynchronously to the page after loading. Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button: When pressing the button, we are presented with a progress bar for five seconds: And when this is completed, we are presented with Hello World! Now suppose we want to scrape this page to get the content that is exposed only after the button is pressed and after the wait? How do we do this? How to do it We can do this using Selenium. We will use two features of Selenium. The first is the ability to click on page elements. The second is the ability to wait until an element with a specific ID is available on the page. First, we get the button and click it. The button's HTML is the following: <div id='start'> <button>Start</button> </div> When the button is pressed and the load completes, the following HTML is added to the document: <div id='finish'> <h4>Hello World!"</h4> </div> We will use the Selenium driver to find the Start button, click it, and then wait until a div with an ID of 'finish' is available. Then we get that element and return the text in the enclosed <h4> tag. You can try this by running 06/03_press_and_wait.py. It's output will be the following: clicked Hello World! Now let's see how it worked. How it works Let us break down the explanation: We start by importing the required items from Selenium: from selenium import webdriver from selenium.webdriver.support import ui Now we load the driver and the page: driver = webdriver.PhantomJS() driver.get("http://the-internet.herokuapp.com/dynamic_loading/2") With the page loaded, we can retrieve the button: button = driver.find_element_by_xpath("//*/div[@id='start']/button") And then we can click the button: button.click() print("clicked") Next we create a WebDriverWait object: wait = ui.WebDriverWait(driver, 10) With this object, we can request Selenium's UI wait for certain events. This also sets a maximum wait of 10 seconds. Now using this, we can wait until we meet a criterion; that an element is identifiable using the following XPath: wait.until(lambda driver: driver.find_element_by_xpath("//*/div[@id='finish']")) When this completes, we can retrieve the h4 element and get its enclosing text: finish_element=driver.find_element_by_xpath("//*/div[@id='finish']/ h4") print(finish_element.text) Limiting crawling to a single domain We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class. How to do it The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov. How it works The code is the same as previous NASA site crawlers except that we include allowed_domains=['nasa.gov']: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] allowed_domains=['nasa.gov'] def parse(self, response): print("Parsing: ", response) The NASA site is fairly consistent with staying within its root domain, but there are occasional links to other sites such as content on boeing.com. This code will prevent moving to those external sites. Processing infinitely scrolling pages Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart. While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page.  Let's look at spidyquotes.herokuapp.com/scroll as an example. Getting ready Open http://spidyquotes.herokuapp.com/scroll in your browser. This page will load additional content when you scroll to the bottom of the page: Screenshot of the quotes to scrape Once the page is open, go into your developer tools and select the network panel. Then, scroll to the bottom of the page. You will see new content in the network panel: When we click on one of the links, we can see the following JSON: { "has_next": true, "page": 2, "quotes": [{ "author": { "goodreads_link": "/author/show/82952.Marilyn_Monroe", "name": "Marilyn Monroe", "slug": "Marilyn-Monroe" }, "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "u201cThis life is what you make it...." }, { "author": { "goodreads_link": "/author/show/1077326.J_K_Rowling", "name": "J.K. Rowling", "slug": "J-K-Rowling" }, "tags": ["courage", "friends"], "text": "u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.u201d" }, This is great because all we need to do is continually generate requests to /api/quotes?page=x, increasing x until the has_next tag exists in the reply document. If there are no more pages, then this tag will not be in the document. How to do it The 06/05_scrapy_continuous.py file contains a Scrapy agent, which crawls this set of pages. Run it with your Python interpreter and you will see output similar to the following (the following is multiple excerpts from the output): <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'Sisters']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'Understand']} When this gets to page 10 it will stop as it will see that there is no next page flag set in the Content. How it works Let's walk through the spider to see how this works. The spider starts with the following definition of the start URL: class Spider(scrapy.Spider): name = 'spidyquotes' quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes' start_urls = [quotes_base_url] download_delay = 1.5 The parse method then prints the response and also parses the JSON into the data variable: def parse(self, response): print(response) data = json.loads(response.body) Then it loops through all the items in the quotes element of the JSON objects. For each item, it yields a new Scrapy item back to the Scrapy engine: for item in data.get('quotes', []): yield { 'text': item.get('text'), 'author': item.get('author', {}).get('name'), 'tags': item.get('tags'), } It then checks to see if the data JSON variable has a 'has_next' property, and if so it gets the next page and yields a new request back to Scrapy to parse the next page: if data['has_next']: next_page = data['page'] + 1 yield scrapy.Request(self.quotes_base_url + "?page=%s" % next_page) There's more... It is also possible to process infinite, scrolling pages using Selenium. The following code is in 06/06_scrape_continuous_twitter.py: from selenium import webdriver import time driver = webdriver.PhantomJS() print("Starting") driver.get("https://twitter.com") scroll_pause_time = 1.5 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: print(last_height) # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(scroll_pause_time) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") print(new_height, last_height) if new_height == last_height: break last_height = new_height The output would be similar to the following: Starting 4882 8139 4882 8139 11630 8139 11630 15055 11630 15055 15055 15055 Process finished with exit code 0 This code starts by loading the page from Twitter. The call to .get() will return when the page is fully loaded. The scrollHeight is then retrieved, and the program scrolls to that height and waits for a moment for the new content to load. The scrollHeight of the browser is retrieved again, and if different than last_height, it will loop and continue processing. If the same as last_height, no new content has loaded and you can then continue on and retrieve the HTML for the completed page. We have discussed the common challenges faced in performing Web Scraping using Python and got to know their workaround. If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Python and perform efficient web scraping.
Read more
  • 0
  • 0
  • 35552

article-image-learning-dependency-injection-di
Packt
08 Mar 2018
15 min read
Save for later

Learning Dependency Injection (DI)

Packt
08 Mar 2018
15 min read
In this article by Sherwin John CallejaTragura, author of the book Spring 5.0 Cookbook, we will learn about implementation of Spring container using XML and JavaConfig,and also managing of beans in an XML-based container. In this article,you will learn how to: Implementing the Spring container using XML Implementing the Spring container using JavaConfig Managing the beans in an XML-based container (For more resources related to this topic, see here.) Implementing the Spring container using XML Let us begin with the creation of theSpring Web Project using the Maven plugin of our STS Eclipse 8.3. This web project will be implementing our first Spring 5.0 container using the XML-based technique. Thisis the most conventional butrobust way of creating the Spring container. The container is where the objects are created, managed, wired together with their dependencies, and monitored from their initialization up to their destruction.This recipe will mainly highlight how to create an XML-based Spring container. Getting ready Create a Maven project ready for development using the STS Eclipse 8.3. Be sure to have installed the correct JRE. Let us name the project ch02-xml. How to do it… After creating the project, certain Maven errors will be encountered. Bug fix the Maven issues of our ch02-xml projectin order to use the XML-based Spring 5.0 container by performing the following steps:  Open pom.xml of the project and add the following properties which contain the Spring build version and Servlet container to utilize: <properties> <spring.version>5.0.0.BUILD-SNAPSHOT</spring.version> <servlet.api.version>3.1.0</servlet.api.version> </properties> Add the following Spring 5 dependencies inside pom.xml. These dependencies are essential in providing us with the interfaces and classes to build our Spring container: <dependencies> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-context</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-core</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-beans</artifactId> <version>${spring.version}</version> </dependency> </dependencies> It is required to add the following repositories where Spring 5.0 dependencies in Step 2 will be downloaded: <repositories> <repository> <id>spring-snapshots</id> <name>Spring Snapshots</name> <url>https://repo.spring.io/libs-snapshot</url> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories> Then add the Maven plugin for deployment but be sure to recognize web.xml as the deployment descriptor. This can be done by enabling<failOnMissingWebXml>or just deleting the<configuration>tag as follows: <plugin> <artifactId>maven-war-plugin</artifactId> <version>2.3</version> </plugin> <plugin> Follow the Tomcat Maven plugin for deployment, as explained in Chapter 1. After the Maven configuration details, check if there is a WEB-INF folder inside src/main/webapp. If there is none, create one. This is mandatory for this project since we will be using a deployment descriptor (or web.xml). Inside theWEB-INF folder, create a deployment descriptor or drop a web.xml template inside src/main/webapp/WEB-INF directory. Then, create an XML-based Spring container named as ch02-beans.xmlinside thech02-xml/src/main/java/ directory. The configuration file must contain the following namespaces and tags: <?xml version="1.0" encoding="UTF-8"?> <beans xsi_schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring- context.xsd"> </beans> You can generate this file using theSTS Eclipse Wizard (Ctrl-N) and under the module SpringSpring Bean Configuration File option Save all the files. Clean and build the Maven project. Do not deploy yet because this is just a standalone project at the moment. How it works… This project just imported three major Spring 5.0 libraries, namely the Spring-Core, Spring-Beans, and Spring-Context,because the major classes and interfaces in creating the container are found in these libraries. This shows that Spring, unlike other frameworks, does not need the entire load of libraries just to setup the initial platform. Spring can be perceived as a huge enterprise framework nowadays but internally it is still lightweight. The basic container that manages objects in Spring is provided by the org.springframework.beans.factory.BeanFactoryinterfaceand can only be found in theSpring-Beansmodule. Once additional features are needed such as message resource handling, AOP capabilities, application-specific contexts and listener implementation, the sub-interface of BeanFactory, namely the org.springframework.context.ApplicationContextinterface, is then used.This ApplicationContext, found in Spring-Contextmodules, is the one that provides an enterprise-specific container for all its applications becauseit encompasses alarger scope of Spring components than itsBeanFactoryinterface. The container created,ch02-beans.xml, anApplicationContext, is an XML-based configuration that contains XSD schemas from the three main libraries imported. These schemashave tag libraries and bean properties, which areessential in managing the whole framework. But beware of runtime errors once libraries are removed from the dependencies because using these tags is equivalent to using the libraries per se. The final Spring Maven project directory structure must look like this: Implementing the Spring container using JavaConfig Another option of implementing the Spring 5.0 container is through the use of Spring JavaConfig. This is a technique that uses pure Java classes in configuring the framework's container. This technique eliminates the use of bulky and tedious XML metadata and also provides a type-safe and refactoring-free approach in configuring entities or collections of objects into the container. This recipe will showcase how to create the container usingJavaConfig in a web.xml-less approach. Getting ready Create another Maven project and name the projectch02-xml. This STSEclipse project will be using a Java class approach including its deployment descriptor. How to do it… To get rid of the usual Maven bugs, immediately open the pom.xmlof ch02-jc and add<properties>, <dependencies>, and <repositories>equivalent to what was added inthe Implementing the Spring Container using XMLrecipe. Next, get rid of the web.xml. Since the time Servlet 3.0 specification was implemented, servlet containers can now support projects without using web.xml. This is done by implementingthe handler abstract class called org.springframework.web.WebApplicationInitializer to programmatically configure ServletContext. Create aSpringWebinitializerclass and override its onStartup() method without any implementation yet: public class SpringWebinitializer implements WebApplicationInitializer { @Override public void onStartup(ServletContext container) throws ServletException { } } The lines in Step 2 will generate some runtime errors until you add the following Maven dependency: <dependency> <groupId>org.springframework</groupId> <artifactId>spring-web</artifactId> <version>${spring.version}</version> </dependency> In pom.xml, disable the<failOnMissingWebXml>. After the Maven details, create a class named BeanConfig, the ApplicationContext definition, bearing an annotation @Configuration at the top of it. The class must be inside the org.packt.starter.ioc.contextpackage and must be an empty class at the moment: @Configuration public class BeanConfig { } Save all the files and clean and build the Maven project.How it works… The Maven project ch02-xml makes use of both JavaConfig and ServletContainerInitializer, meaning there will be no XML configuration from servlet to Spring 5.0 containers. The BeanConfigclass is the ApplicationContext of the project which has an annotation @Configuration,indicating that the class is used by JavaConfig as asource of bean definitions.This is handy when creating an XML-based configuration with lots of metadata. On the other hand,ch02-xmlimplemented org.springframework.web.WebApplicationInitializer,which is a handler of org.springframework.web.SpringServletContainerInitializer, the framework's implementation class to theservlet'sServletContainerInitializer. The SpringServletContainerInitializerisnotified byWebApplicationInitializerduring the execution of its startup(ServletContext) with regard to theprogramaticalregistration of filters, servlets, and listeners provided by the ServletContext . Eventually, the servlet container will acknowledge the status reported by SpringServletContainerInitialize,thus eliminating the use of web.xml. On Maven's side, the plugin for deployment must be notified that the project will not use web.xml.This is done through setting the<failOnMissingWebXml>to false inside its<configuration>tag. The final Spring Web Project directory structure must look like the following structure: Managing the beans in an XML-based container Frameworks become popular because of the principle behind the architecture they are made up from. Each framework is built from different design patterns that manage the creation and behavior of the objects they manage. This recipe will detail how Spring 5.0 manages objects of the applications and how it shares a set of methods and functions across the platform. Getting ready The two Maven projects previously created will be utilized in illustrating how Spring 5.0 loads objects into the heap memory.We will also be utilizing the ApplicationContextrather than the BeanFactorycontainer in preparation for the next recipes involving more Spring components. How to do it… With our ch02-xml, let us demonstrate how Spring loads objects using the XML-based Application Context container: Create a package layer,org.packt.starter.ioc.model,for our model classes. Our model classes will be typical Plain Old Java Objects(POJO),by which Spring 5.0 architecture is known for. Inside the newly created package, create the classes Employeeand Department,whichcontain the following blueprints: public class Employee { private String firstName; private String lastName; private Date birthdate; private Integer age; private Double salary; private String position; private Department dept; public Employee(){ System.out.println(" an employee is created."); } public Employee(String firstName, String lastName, Date birthdate, Integer age, Double salary, String position, Department dept) { his.firstName = firstName; his.lastName = lastName; his.birthdate = birthdate; his.age = age; his.salary = salary; his.position = position; his.dept = dept; System.out.println(" an employee is created."); } // getters and setters } public class Department { private Integer deptNo; private String deptName; public Department() { System.out.println("a department is created."); } // getters and setters } Afterwards, open the ApplicationContextch02-beans.xml. Register using the<bean>tag our first set of Employee and Department objects as follows: <bean id="empRec1" class="org.packt.starter.ioc.model.Employee" /> <bean id="dept1" class="org.packt.starter.ioc.model.Department" /> The beans in Step 3 containprivate instance variables that havezeroes and null default values. Toupdate them, our classes havemutators or setter methodsthat can be used to avoid NullPointerException, which happens always when we immediately use empty objects. In Spring,calling these setters is tantamount to injecting data into the<bean>,similar to how these following objectsare created: <bean id="empRec2" class="org.packt.starter.ioc.model.Employee"> <property name="firstName"><value>Juan</value></property> <property name="lastName"><value>Luna</value></property> <property name="age"><value>70</value></property> <property name="birthdate"><value>October 28, 1945</value></property> <property name="position"> <value>historian</value></property> <property name="salary"><value>150000</value></property> <property name="dept"><ref bean="dept2"/></property> </bean> <bean id="dept2" class="org.packt.starter.ioc.model.Department"> <property name="deptNo"><value>13456</value></property> <property name="deptName"> <value>History Department</value></property> </bean> A<property>tag is equivalent to a setter definition accepting an actual value oran object reference. The nameattributedefines the name of the setter minus the prefix set with the conversion to itscamel-case notation. The value attribute or the<value>tag both pertain to supported Spring-type values (for example,int, double, float, Boolean, Spring). The ref attribute or<ref>provides reference to another loaded<bean>in the container. Another way of writing the bean object empRec2 is through the use of ref and value attributes such as the following: <bean id="empRec3" class="org.packt.starter.ioc.model.Employee"> <property name="firstName" value="Jose"/> <property name="lastName" value="Rizal"/> <property name="age" value="101"/> <property name="birthdate" value="June 19, 1950"/> <property name="position" value="scriber"/> <property name="salary" value="90000"/> <property name="dept" ref="dept3"/> </bean> <bean id="dept3" class="org.packt.starter.ioc.model.Department"> <property name="deptNo" value="56748"/> <property name="deptName" value="Communication Department" /> </bean> Another way of updating the private instance variables of the model objects is to make use of the constructors. Actual Spring data and object references can be inserted to the through the metadata: <bean id="empRec5" class="org.packt.starter.ioc.model.Employee"> <constructor-arg><value>Poly</value></constructor-arg> <constructor-arg><value>Mabini</value></constructor-arg> <constructor-arg><value> August 10, 1948</value></constructor-arg> <constructor-arg><value>67</value></constructor-arg> <constructor-arg><value>45000</value></constructor-arg> <constructor-arg><value>Linguist</value></constructor-arg> <constructor-arg><ref bean="dept3"></ref></constructor-arg> </bean> After all the modifications, save ch02-beans.xml.Create a TestBeans class inside thesrc/test/java directory. This class will load the XML configuration resource to the ApplicationContext container throughorg.springframework.context.support.ClassPathXmlApplicationContextand fetch all the objects created through its getBean() method. public class TestBeans { public static void main(String args[]){ ApplicationContext context = new ClassPathXmlApplicationContext("ch02-beans.xml"); System.out.println("application context loaded."); System.out.println("****The empRec1 bean****"); Employee empRec1 = (Employee) context.getBean("empRec1"); System.out.println("****The empRec2*****"); Employee empRec2 = (Employee) context.getBean("empRec2"); Department dept2 = empRec2.getDept(); System.out.println("First Name: " + empRec2.getFirstName()); System.out.println("Last Name: " + empRec2.getLastName()); System.out.println("Birthdate: " + empRec2.getBirthdate()); System.out.println("Salary: " + empRec2.getSalary()); System.out.println("Dept. Name: " + dept2.getDeptName()); System.out.println("****The empRec5 bean****"); Employee empRec5 = context.getBean("empRec5", Employee.class); Department dept3 = empRec5.getDept(); System.out.println("First Name: " + empRec5.getFirstName()); System.out.println("Last Name: " + empRec5.getLastName()); System.out.println("Dept. Name: " + dept3.getDeptName()); } } The expected output after running the main() thread will be: an employee is created. an employee is created. a department is created. an employee is created. a department is created. an employee is created. a department is created. application context loaded. *********The empRec1 bean *************** *********The empRec2 bean *************** First Name: Juan Last Name: Luna Birthdate: Sun Oct 28 00:00:00 CST 1945 Salary: 150000.0 Dept. Name: History Department *********The empRec5 bean *************** First Name: Poly Last Name: Mabini Dept. Name: Communication Department How it works… The principle behind creating<bean>objects into the container is called the Inverse of Control design pattern. In order to use the objects, its dependencies, and also its behavior, these must be placed within the framework per se. After registering them in the container, Spring will just take care of their instantiation and their availability to other objects. Developer can just "fetch" them if they want to include them in their software modules,as shown in the following diagram: The IoC design pattern can be synonymous to the Hollywood Principle (“Don't call us, we’ll call you!”), which is a popular line in most object-oriented programming languages. The framework does not care whether the developer needs the objects or not because the lifespan of the objects lies on the framework's rules. In the case of setting new values or updating values of the object's private variables, IoC has an implementation which can be used for "injecting" new actual values or object references to and it is popularly known as the Dependency Injection(DI) design pattern. This principle exposes all the to the public through its setter methods or the constructors. Injecting Spring values and object references to the method signature using the <property>tag without knowing its implementation is called the Method Injection type of DI. On the other hand, if we create the bean with initialized values injected to its constructor through<constructor-arg>, it is known as Constructor Injection. To create the ApplicationContext container, we need to instantiate ClassPathXmlApplicationContext or FileSystemApplicationContext, depending on the location of the XML definition file. Since the file is found in ch02-xml/src/main/java/, ClassPathXmlApplicationContext implementation is the best option. This proves that the ApplicationContext is an object too,bearing all those XML metadata. It has several overloaded getBean() methods used to fetch all the objects loaded with it. Summary In this article we went overhow to create an XML-based Spring container, how to create the container using JavaConfig in a web.xml-less approach andhow Spring 5.0 manages objects of the applications and how it shares a set of methods and functions across the platform. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 25875

article-image-spam-filtering-natural-language-processing-approach
Packt
08 Mar 2018
16 min read
Save for later

Spam Filtering - Natural Language Processing Approach

Packt
08 Mar 2018
16 min read
In this article, by Jalaj Thanaki, the author of the book Python Natural Language Processing discusses how to develop natural language processing (NLP) application. In this article, we will be developing a spam filtering. In order to develop spam filtering we will be using supervised machine learning (ML) algorithm named logisticregression. You can also use decision tree, NaiveBayes,or support vector machine (SVM).Tomake this happen the following steps will be covered: Understandlogistic regression with MLalgorithm Data collection and exploration Split dataset into training-dataset and testing-dataset (For more resources related to this topic, see here.) Understanding logistic regression ML algorithm Let's understand logistic regression algorithm first.For this classification algorithm, I will give you intuition how logistic regression algorithm works and we will see some basic mathematics related to it. Then we will see the spam filtering application. First we are considering the binary classes like spam or not-spam, good or bad, win or lose, 0 or 1, and so on for understanding the algorithm and its application. Suppose I want to classify emails into spam and non-spam (ham)category so the spam and non-spam are discrete output label or target concept here. Our goal here is that we want to predict that whether the new email is spam or not-spam. Not-spam also known asham. In order to build this NLP application we are going to use logistic regression. Let's step back a while and understand the technicality of algorithm first. Here I'm stating the facts related to mathematics and this algorithm in very simple manner so everyone can understand the logic. General approach for understanding this algorithm is as follows. If you know some part of ML then you can connect your dot and if you are new to ML then don't worry because we are going to understand every part which I will describe as follows: We are defining our hypothesis function which helps us to generate our target output or target concept We are defining the cost function or error function and we choose error function in such a way that we can derive the partial derivate of error function easily so we can calculate gradient descent easily Over the time we are trying to minimize the error so we can generate the more accurate label and classify data accurately In statistics, logistic regression is also called as logitregression or logitmodel. This algorithm is mostly used as binary class classifier that means there should be two different class in which you want to classify the data. The binary logistic model is used to estimate the probability of a binary response and it generates the response based on one or more predictor or independent variables or features. By the way the ML algorithm that basic mathematics concepts used in deep learning (DL) as well. First I want to explain that why this algorithm called logistic regression? The reason is that the algorithm uses logistic function or sigmoid function and that is the reason it called logistic regression. Logistic function or sigmoid function are the synonyms of each other. We use sigmoid function as hypothesis function and this function belongs to the hypothesis class. Now if you want to say thatwhat do you mean by the hypothesis function? well as we have seen earlier that machine has to learn mapping between data attributes and given label in such a way so it can predict the label for new data. This can be achieved by machine if it learns this mapping using mathematical function. So the mathematical function is called hypothesis function,which machine will use to classify the data and predict the labels or target concept. Here, as I said, we want to build binary classifier so our label is either spam or ham. So mathematically I can assign 0 for ham or not-spam and 1 for spam or viceversa as per your choice. These mathematically assigned labels are our dependent variables. Now we need that our output labels should be either zero or one. Mathematically,we can say that label is y and y ∈ {0, 1}. So we need to choose that kind of hypothesis function which convert our output value either in zero or one and logistic function or sigmoid function is exactly doing that and this is the main reason why logistic regression uses sigmoid function as hypothesis function. Logistic or Sigmoid Function Let me provide you the mathematical equation for logistic or sigmoid function. Refer to Figure 1: Figure 1: Logistic or sigmoid function You can see the plot which is showing g(z). Here, g(z)= Φ(z). Refer to Figure 2: Figure 2: Graph of sigmoid or logistic function From the preceding graph you can see following facts:  If you have z value greater than or equal to zero then logistic function gives the output value one.  If you have value of z less than zero then logistic function or sigmoid function generate the output zero. You can see the following mathematical condition for logistic function. Refer to Figure 3:   Figure 3: Logistic function mathematical property Because of the preceding mathematical property, we can use this function to perform binary classification. Now it's time to show the hypothesis function how this sigmoid function will be represented as hypothesis function. Refer to Figure 4: Figure 4: Hypothesis function for logistic regression If we take the preceding equation and substitute the value of z with θTx then equation given in Figure 1gets convertedas following. Refer to Figure 5: Figure 5: Actual hypothesis function after mathematical manipulation Here hθx is the hypothesis function,θT is the matrix of the feature or matrix of the independent variables and transpose representation of it, x is the stand for all independent variables or for all possible feature set. In order to generate the hypothesis equation we replace the z value of logistic function with θTx. By using hypothesis equation machine actually tries to learn mapping between input variables or input features, and output labels. Let's talk a bit about the interpretation of this hypothesis function. Here for logistic regression, can you think what is the best way to predict the class label? Accordingly, we can predict the target class label by using probability concept. We need to generate the probability for both classes and whatever class has high probability we will assign that class label for that particular instance of feature. So in binary classification the value of y or target class is either zero or one. So if you are familiar with probability then you can represent the probability equation as given in Figure 6: Figure 6: Interpretation of hypothesis function using probabilistic representation So those who are not familiar with probability the P(y=1|x;θ) can be read like this. Probability of y =1, given x, and parameterized by θ. In simple language you can say like this hypothesis function will generate the probability value for target output 1 where we give features matrix x and some parameter θ. This seems intuitive concept, so for a while, you can keep all these in your mind. I will later on given you the reason why we need to generate probability as well as let you know how we can generate probability values for each of the class. Here we complete first step of general approach to understand the logistic regression. Cost or Error function for logistic regression First, let's understand what is cost function or the error function? Cost function or lose function, or error function are all the same things. In ML it is very important concept so here we understand definition of cost function and what is the purpose of defining the cost function. Cost function is the function which we use to check how accurate our ML classifier performs. So let me simplify this for you, in our training dataset we have data and we have labels. Now, when we use hypothesis function and generate the output we need to check how much near we are from the actual prediction and if we predict the actual output label then the difference between our hypothesis function output and actual label is zero or minimum and if our hypothesis function output and actual label are not same then we have big difference between them. So suppose if actual label of email is spam which is 1 and our hypothesis function also generate the result 1 then difference between actual target value and predicated output value is zero and therefore error in prediction is also zero and if our predicted output is 1 and actual output is zero then we have maximum error between our actual target concept and prediction. So it is important for us to have minimum error in our predication. This is the very basic concept of error function. We will get in to the mathematics in some minutes. There are several types of error function available like r2 error, sum of squared error, and so on. As per the ML algorithm and as per the hypothesis function our error function also changes. Now I know you wanted to know what will be the error function for logistic regression? and I have put θ in our hypothesis function so you also want to know what is θ and if I need to choose some value of the θ then how can I approach it? So here I will give all answers. Let me give you some background what we used to do in linear regression so it will help you to understand the logistic regression. We generally used sum of squared error or residuals error, or cost function. In linear regression we used to use it. So, just to give you background about sum of squared error. In linear regression we are trying to generate the line of best fit for our dataset so as I stated the example earlier given height I want to predict the weight and in this case we fist draw a line and measure the distance from each of the data point to line. We will square these distance and sum them and try to minimize this error function. Refer to Figure 7: Figure 7: Sum of squared error representation for reference You can see the distance of each data point from the line which is denoted using red line we will take this distance, square them, and sum them. This error function we will use in linear regression. We use this error function and we have generate partial derivative with respect to slop of line m and with respect to intercept b. Every time we calculate error and update the value of m and b so we can generate the line of best fit. The process of updating m and b is called gradient descent. By using gradient descent we update m and b in such a way so our error function has minimum error value and we can generate line of best fit. Gradient descent gives us a direction in which we need to plot a line so we can generate the line of best fit. You can find the detail example in Chapter 9,Deep Learning for NLU and NLG Problems. So by defining error function and generating partial derivatives we can apply gradient descent algorithm which help us to minimize our error or cost function. Now back to the main question which error function can we use for logistic regression? What you think can we use this as sum of squared error function for logistic regression as well? If you know function and calculus very well, then probably your answer is no. That is the correct answer. Let me explain this for those who aren't familiar with function and calculus. This is important so be careful. In linear regression our hypothesis function is linear so it is very easy for us to calculate sum of squared errors but here we are using sigmoid function which is non-linear function if you apply same function which we used in linear regression will not turn out well because if you take sigmoid function and put into the sum of squared error function then and if you try to visualized the all possible values then you will get non-convex curve. Refer to Figure 8: Figure 8: Non-convex with (Image credit: http://www.yuthon.com/images/non-convex_and_convex_function.png) In machine learning we majorly use function which are able to provide convex curve because then we can use gradient descent algorithm to minimize the error function and able to reach at global minimum certainly. As you saw in Figure 8, non-convex curve has many local minimum so in order to reach to global minimum is very challenging and very time consuming because then you need to apply second order or nth order optimization in order to reach to global minimum where in convex curve you can reach to global minimum certainly and fast as well. So if we plug our sigmoid function in sum of squared error then you get the non-convex function so we are not going to define same error function which we use in linear regression. So, we need to define a different cost function which is convex so we can apply gradient descent algorithm and generate global minimum. So here we are using the statistical concept called likelihood. To derive likelihood function we will use the equation of the probability which is given in Figure 6 and we are considering all data points in training set. So we can generate the following equation which is the likelihood function. Refer to Figure 9: Figure 9: likelihood function for logistic regression (Image credit: http://cs229.stanford.edu/notes/cs229-notes1.pdf) Now in order to simplify the derivative process we need to convert the likelihood function into monotonically increasing function which can be achieved by taking natural logarithm of the likelihood function and this is called loglikelihood. This log likelihood is our cost function for logistic regression. See the following equation given in Figure 10: Figure 10: Cost function for logistic regression Here to gain some intuition about the given cost function we will plot it and understand what benefit it provides to us. Here in xaxis we have our hypothesis function. Our hypothesis function range is 0 to 1 so we have these two points on xaxis. Start with the first case where y =1. You can see the generated curve which is on top right hand side in Figure 11: Figure 11: Logistic function cost function graphs If you see any log function plot and then flip that curve because here we have negative sign then you get the same curve as we plot in Figure 11. you can see the log graph as well as flipped graph in Figure 12: Figure 12:comparing log(x) and –log(x) graph for better understanding of cost function (Image credit : http://www.sosmath.com/algebra/logs/log4/log42/log422/gl30.gif) So here we are interested for value 0 and 1 so we are considering that part of the graph which we have depicted in Figure 11. This cost function has some interesting and useful properties. If predict or candidate label is same as the actual target label then cost will be zero so you can put like this if y=1 and hypothesis function predict hθ(x) = 1 then cost is 0 but if hθ(x) tends to 0 means more towards the zero then cost function blows up to ∞. Now you can see for the y = 0 you can see the graph which is on top left hand side inside the Figure 11. This case condition also have same advantages and properties which we have seen earlier. It will go to ∞ when actual value is 0 and hypothesis function predicts 1. If hypothesis function predict 0 and actual target is also 0 then cost =0. As I told you earlier that I will give you reason why we are choosing this cost function then the reason is that this function makes our optimization easy as we are using maximum log likelihood function as we as this function has convex curve which help us to run gradient decent. In order to apply gradient decent we need to generate the partial derivative with respect to θ and we can generate the following equation which is given in Figure 13: Figure 13: Partial derivative for performing gradient descent (Image credit : http://2.bp.blogspot.com) This equation is used for updating the parameter value of θ and α is here define the learning rate. This is the parameter which you can use how fast or how slow your algorithm should learn or train. If you set learning rate too high then algorithm can not learn and if you set it too low then it take lot of time to train. So you need to choose learning rate wisely. Now let's start building the spam filtering application. Data loading and exploration To build the spam filtering application we need dataset. Here we are using small size dataset. This dataset is simply straight forward. This dataset has two attribute. The first attribute is the label and second attribute is the text content of the email. Let's discuss more about the first attribute. Here the presence of label make this dataset a tagged data. This label indicated that the email content is belong to thespam category or ham category. Let's jump into the practical part. Here we are using numpy, pandas, andscikit-learnas dependency libraries. So let's explore or dataset first.We read dataset using pandas library.I have also checked how many total data records we have and basic details of the dataset. Once we load data,we will check its first ten records and then we will replace the spam and ham categories with number. As we have seen that machine can understand numerical format only so here all labels ham is converted into 0 and all labels spam is converted into 1.Refer to Figure 14: Figure 14: Code snippet for converting labels into numerical format Split dataset intotrainingdataset and testingdataset In this part we divide our dataset into two parts one part is called training set and other part is called testing set. Refer to Figure 15: Figure 15: Code snippet for dividing dataset into trainingdataset and testingdataset We are dividing dataset into two partsbecause we will perform training by using our trainingdataset and one our ML algorithm trained on that dataset and generate ML-model after that we will use generated ML-model and feed testing into that model as result our ML-model will generate the prediction. Based on that result we evaluate out ML-model Summary Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 7810

article-image-data-explorationusing-spark-sql
Packt
08 Mar 2018
9 min read
Save for later

Data Exploration using Spark SQL

Packt
08 Mar 2018
9 min read
In this article, Aurobindo Sarkar, the author of the book, Learning Spark SQL, we will be covering the following points to introduce you to using Spark SQL for exploratory data analysis. What is exploratory Data Analysis (EDA)? Why EDA is important? Using Spark SQL for basic data analysis Visualizing data with Apache Zeppelin Introducing Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a “feel”, for the data set as a result of such exploration. More specifically, the graphical techniques allow us to, efficiently, select and validate appropriate models, test our assumptions, identify relationships, select estimators, and detect outliers. EDA involves a lot of trial and error, and several iterations.The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement.For example, you should plot simple histograms and scatterplots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take long to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java, and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer there are many good tools available such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data then this section will offer some good tools and techniques data exploration. Here, we will do some basic data exploration exercises to understand a sample dataset. We will use a dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We use the bank-additional-full.csv file that contains 41188 records and 20 input fields, ordered by date (from May 2008 to November 2010). As a first step, let’s define a schema and read in the CSV file to create a DataFrame. You can use :paste to paste-in the initial set of statements in the Spark shell, as shown in the following figure: After the DataFrame is created, we first verify the number of records. We can also define a case class called Call for our input records, and then create a strongly-typed Dataset as follows: In the next section, we will begin our data exploration by identifying missing data in our dataset. Identifying Missing Data Missing data can occur in datasets due to reasons ranging from negligence to a refusal on part of respondants to provide a specific data point. However, in all cases missing data is a common occurrence in real-world datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies for dealing with it. Here, weanalyzethe numbers of records with missing data fields in our sample dataset. In order to simulate missing data, we will edit our sample dataset by replacing fields containing “unknown” values with empty strings. First, we created a DataFrame / Dataset from our edited file, as shown in the following figure: The following two statements give us a count of rows with certain fields having missing data. Later, we will look at effective ways of dealing with missing data and compute some basic statistics for sample dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a dataset containing a subset of fields from our original DataFrame. In our following example, we choose some of the numeric fields and the outcome field that is the “term deposit subscribed” field. Next, we use describe to quickly compute the count, mean, standard deviation, min and max values for the numeric columns in our dataset. Further, we use the stat package to compute additional statistics like covariance, correlation, creating crosstabs, examining items that occur most frequently in data columns, and computing quantiles. These computations are shown in the following figure: Next, we use the typed aggregation functions to summarize our data to understand our data better. In the following statement, we aggregate the results by whether a term deposit was subscribed along with total customers contacted, average number of calls made per customer, the average duration of the calls and the average number of previous calls made to such customers. The results are rounded to two decimal points. Similarly, executing the following statement givessimilar results by customers’ age.   After getting a better understanding of our data by computing basic statistics, we shift our focus to identifying outliers in our data. Identifying data outliers An outlier or an anomalyis an observation of the data that deviates significantly from other observations in the dataset. Erroneous outliers are observations thatare distorted due to possible errors in the data-collection process. These outliers may exert undue influence on the results of statistical analysis, sothey should be identified using reliable detection methods prior to performing data analysis. Many algorithms find outliers as a side-product of clustering algorithms. However these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using a statistical distributions, and the outliers identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically doesnot have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. SparkMLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. In the following example, we use the k-means clustering algorithm to compute two clusters in our data. Other distributed algorithms useful for EDA includeclassification, regression, dimensionality reduction, correlation and hypothesis testing. Visualizing data with Apache Zeppelin Typically, we will generate many graphs to verify our hunches about the data.A lot of thesequick and dirty graphs used during EDA are,ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers, typically, cannot handle millions of data points.Hence we have to summarize, sample or model our data before we can effectively visualize it. Traditionally, BI tools provided extensive aggregation and pivoting features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data wassubsequently downloaded and visualizedon the practitioner’s workstations. Spark can eliminate many of these batch jobs to support interactive data visualization. Here, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is a web-based tool that supports interactive data analysis and visualization. It supports several language interpreters and comes with built-in Spark integration. Hence, it is quick and easy to get started with exploratory data analysis using Apache Zeppelin. You can download Appache Zeppelin from https://zeppelin.apache.org/. Unzip the package on your hard drive and start Zeppelin using the following command: Aurobindos-MacBook-Pro-2:zeppelin-0.6.2-bin-allaurobindosarkar$ bin/zeppelin-daemon.sh start You should see the following message: Zeppelin start                                             [ OK] You should be able to see the Zeppelin home page at: http://localhost:8080/ Click on Create new note link, and specify a path and name for your notebook, as shown in the following figure: In the next step, we paste the same code as in the beginning of this article to create a DataFrame for our sample dataset. We can execute typical DataFrameoperations as shown in the following figure: Next, we create a table from our DataFrame and execute some SQL on it. The results of the SQL statements execution can be charted by clicking on the appropriatechart-type required. Here, we create bar charts as an illustrative example of summarizing and visualizing data: We can also plot a scatter plot, and read the coordinate values of each of the points plotted, as shown in the following two figures. Additionally, we can create a textbox that accepts input values to make experience interactive. In the following figure we create a textbox that can accept different values for the age parameter and the bar chart is updated, accordingly. Similarly, we can also create dropdown lists where the user can select the appropriate option, and the table of values or chart, automatically gets updated. Summary In this article, we demonstrated using Spark SQL for exploring datasets, performing basic data quality checks, generating samples and pivot tables, and visualizing data with Apache Zeppelin.
Read more
  • 0
  • 0
  • 27233

article-image-how-to-set-up-a-deep-learning-system-on-amazon-web-services-aws
Gebin George
07 Mar 2018
5 min read
Save for later

How to set up a Deep Learning System on Amazon Web Services (AWS)

Gebin George
07 Mar 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials written by Wei Di, Anurag Bhardwaj, and Jianing Wei.  This book covers popular Python libraries such as Tensorflow, Keras, and more, along with tips to train, deploy and optimize deep learning models in the best possible manner.[/box] Today, we will learn two different methods of setting up a deep learning system using Amazon Web Services (AWS). Setup from scratch We will illustrate how to set up a deep learning environment on an AWS EC2 GPU instance g2.2xlarge running Ubuntu Server 16.04 LTS. For this example, we will use a pre-baked Amazon Machine Image (AMI) which already has a number of software packages installed—making it easier to set up an end-end deep learning system. We will use a publicly available AMI Image ami-b03ffedf, which has following pre-installed Packages: CUDA 8.0 Anaconda 4.20 with Python 3.0 Keras / Theano The first step to setting up the system is to set up an AWS account and spin a new EC2 GPU instance using the AWS web console as (http://console.aws.amazon.com/) shown in figure Choose EC2 AMI: 2. We pick a g2.2xlarge instance type from the next page as shown in figure Choose instance type: 3. After adding a 30 GB of storage as shown in figure Choose storage, we now launch a cluster and assign an EC2 key pair that can allow us to ssh and log in to the box using the provided key pair file: 4. Once the EC2 box is launched, next step is to install relevant software packages.To ensure proper GPU utilization, it is important to ensure graphics drivers are installed first. We will upgrade and install NVIDIA drivers as follows: $ sudo add-apt-repository ppa:graphics-drivers/ppa -y $ sudo apt-get update $ sudo apt-get install -y nvidia-375 nvidia-settings While NVIDIA drivers ensure that host GPU can now be utilized by any deep learning application, it does not provide an easy interface to application developers for easy programming on the device. Various different software libraries exist today that help achieve this task reliably. Open Computing Language (OpenCL) and CUDA are more commonly used in industry. In this book, we use CUDA as an application programming interface for accessing NVIDIA graphics drivers. To install CUDA driver, we first SSH into the EC2 instance and download CUDA 8.0 to our $HOME folder and install from there: $ wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-r epo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb $ sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_amd64-deb $ sudo apt-get update $ sudo apt-get install -y cuda nvidia-cuda-toolkit Once the installation is finished, you can run the following command to validate the installation: $ nvidia-smi Now your EC2 box is fully configured to be used for a deep learning development. However, for someone who is not very familiar with deep learning implementation details, building a deep learning system from scratch can be a daunting task. To ease this development, a number of advanced deep learning software frameworks exist, such as Keras and Theano. Both of these frameworks are based on a Python development environment, hence we first install a Python distribution on the box, such as Anaconda: $ wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh $ bash Anaconda3-4.2.0-Linux-x86_64.sh Finally, Keras and Theanos are installed using Python’s package manager pip: $ pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git $ pip install keras Once the pip installation is completed successfully, the box is now fully set up for a deep learning development. Setup using Docker The previous section describes getting started from scratch which can be tricky sometimes given continuous changes to software packages and changing links on the web. One way to avoid dependence on links is to use container technology like Docker. In this chapter, we will use the official NVIDIA-Docker image that comes pre-packaged with all the necessary packages and deep learning framework to get you quickly started with deep learning application development: $ sudo add-apt-repository ppa:graphics-drivers/ppa -y $ sudo apt-get update $ sudo apt-get install -y nvidia-375 nvidia-settings nvidia-modprobe We now install Docker Community Edition as follows: $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - # Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 $ sudo apt-key fingerprint 0EBFCD88 $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) Stable" $ sudo apt-get update $ sudo apt-get install -y docker-ce 2. We then install NVIDIA-Docker and its plugin: $ wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nv Idia-docker_1.0.1-1_amd64.deb $ sudo dpkg -i /tmp/nvidia-docker_1.0.1-1_amd64.deb && rm /tmp/nvidia-docker_1.0.1-1_amd64.deb 3. To validate if the installation happened correctly, we use the following command:  $ sudo nvidia-docker run --rm nvidia/cuda nvidia-smi 4. Once it’s setup correctly, we can use the official TensorFlow or Theano Docker Image: $ sudo nvidia-docker run -it tensorflow/tensorflow:latest-gpu bash 5. We can run a simple Python program to check if TensorFlow works properly: import tensorflow as tf a = tf.constant(5, tf.float32) b = tf.constant(5, tf.float32) with tf.Session() as sess: sess.run(tf.add(a, b)) # output is 10.0 print("Output of graph computation is = ",output) You should see the TensorFlow output on the screen now as shown in figure Tensorflow sample output: We saw how to set up deep learning system on AWS from scratch and on Docker. If you found our post useful, do check out this book Deep Learning Essentials  to optimize deep learning models for better performance output.  
Read more
  • 0
  • 0
  • 13127
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-aspnet-core-high-performance
Packt
07 Mar 2018
20 min read
Save for later

ASP.NET Core High Performance

Packt
07 Mar 2018
20 min read
In this article by James Singleton, the author of the book ASP.NET Core High Performance, we will see that many things have changed for version 2 of the ASP.NET Core framework and there have also been a lot of improvements to the various supporting technologies. Now is a great time to give it a try, as the code has stabilized and the pace of change has settled down a bit. There were significant differences between the original release candidate and version 1 of ASP.NET Core and yet more alterations between version 1 and version 2. Some of these changes have been controversial, particularly around tooling but the scope of .NET Core has grown massively and ultimately this is a good thing. One of the highest profile differences between 1 and 2 is the change (and some would say regression) away from the new JavaScript Object Notation (JSON) based project format and back towards the Extensible Markup Language (XML) based .csproj format. However, it is a simplified and stripped down version compared to the format used in the original .NET Framework. There has been a move towards standardization between the different .NET frameworks and .NET Core 2 has a much larger API surface as a result. The interface specification known as .NET Standard 2 covers the intersection between .NET Core, the .NET Framework, and Xamarin. There is also an effort to standardize Extensible Application Markup Language (XAML) into the XAML Standard that will work across Universal Windows Platform (UWP) and Xamarin.Forms apps. C# and .NET can be used on a huge amount of platforms and in a large number of use cases, from server side web applications to mobile apps and even games using engines like Unity 3D. In this article we will go over the changes between version 1 and version 2 of the new Core releases. We will also look at some new features of the C# language. There are many useful additions and a plethora of performance improvement too. In this article we will cover: .NET Core 2 scope increases ASP.NET Core 2 additions Performance improvements .NET Standard 2 New C# 6 features New C# 7 features JavaScript considerations New in Core 2 There are two different products in the Core family. The first is .NET Core, which is the low level framework providing basic libraries. This can be used to write console applications and it is also the foundation for higher level application frameworks. The second is ASP.NET Core, which is a framework for building web applications that run on a server and service clients (usually web browsers). This was originally the only workload for .NET Core until it grew in scope to handle a more diverse range of scenarios. We'll cover the differences in the new versions separately for each of these frameworks. The changes in .NET Core will also apply to ASP.NET Core, unless you are running it on top of the .NET Framework version 4. New in .NET Core 2 The main focus of .NET Core 2 is the huge increase in scope. There are more than double the number of APIs included and it supports .NET Standard 2 (covered later in this article). You can also refer .NET Framework assemblies with no recompile required. This should just work as long as the assemblies only use APIs that have been implemented in .NET Core. This means that more NuGet packages will work with .NET Core. Finding if your favorite library was supported or not, was always a challenge with the previous version. The author set up a repository listing package compatibility to help with this. You can find the ASP.NET Core Library and Framework Support (ANCLAFS) list at github.com/jpsingleton/ANCLAFS and also via anclafs.com. If you want to make a change then please send a pull request. Hopefully in the future all packages will support Core and this list will no longer be required. There is now support in Core for Visual Basic and for more Linux distributions. You can also perform live unit testing with Visual Studio 2017, much like the old NCrunch extension. Performance improvements Some of the more interesting changes for 2 are the performance improvements over the original .NET Framework. There have been tweaks to the implementations of many of the framework data structures. Some of the classes and methods that have seen speed improvements or memory reduction include: List<T> Queue<T> SortedSet<T> ConcurrentQueue<T> Lazy<T> Enumerable.Concat() Enumerable.OrderBy() Enumerable.ToList() Enumerable.ToArray() DeflateStream SHA256 BigInteger BinaryFormatter Regex WebUtility.UrlDecode() Encoding.UTF8.GetBytes() Enum.Parse() DateTime.ToString() String.IndexOf() String.StartsWith() FileStream Socket NetworkStream SslStream ThreadPool SpinLock We won't go into specific benchmarks here because benchmarking is hard and the improvements you see will clearly depend on your usage. The thing to take away is that lots of work has been done to increase the performance of .NET Core, both over the previous version 1 and .NET Framework 4.7. Many of these changes have come from the community, which shows one of the benefits of open source development. Some of these advances will probably work their way back into a future version of the regular .NET Framework too. There have also been improvements to the RyuJIT compiler for .NET Core 2. As just one example, finally blocks are now almost as efficient as not using exception handing at all, in the normal situation where no exceptions are thrown. You now have no excuses not to liberally use try and using blocks, for example by having checked arithmetic to avoid integer overflows. New in ASP.NET Core 2 ASP.NET Core 2 takes advantage of all the improvements to .NET Core 2, if that is what you choose to run it on. It will also run on .NET Framework 4.7 but it's best to run it on .NET Core, if you can. With the increase in scope and support of .NET Core 2 this should be less of a problem than it was previously. It includes a new meta package so you only need to reference one NuGet item to get all the things! However, it is still composed of individual packages if you want to pick and choose. They haven't reverted back to the bad old days of one huge System.Web assembly. A new package trimming feature will ensure that if you don't use a package then its binaries won't be included in your deployment, even if you use the meta package to reference it. There is also a sensible default for setting up a web host configuration. You don't need to add logging, Kestrel, and IIS individually anymore. Logging has also got simpler and, as it is built in, you have no excuses not to use it from the start. A new feature is support for controller-less Razor Pages. These are exactly what they sound like and allow you to write pages with just a Razor template. This is similar to the Web Pages product, not to be confused with Web Forms. There is talk of Web Forms making a comeback, but if so then hopefully the abstraction will be thought out more and it won't carry so much state around with it. There is a new authentication model that makes better use of Dependency Injection. ASP.NET Core Identity allows you to use OpenID, OAuth 2 and get access tokens for your APIs. A nice time saver is you no longer need to emit anti-forgery tokens in forms (to prevent Cross Site Request Forgery) with attributes to validate them on post methods. This is all done automatically for you, which should prevent you forgetting to do this and leaving a security vulnerability. Performance improvements There have been additional increases to performance in ASP.NET Core that are not related to the improvements in .NET Core, which also help. Startup time has been reduced by shipping binaries that have already been through the Just In Time compilation process. Although not a new feature in ASP.NET Core 2, output caching is now available. In 1.0, only response caching was included, which simply set the correct HTTP headers. In 1.1, an in-memory cache was added and today you can use local memory or a distributed cache kept in SQL Server or Redis. Standards Standards are important, that's why we have so many of them. The latest version of the .NET Standard is 2 and .NET Core 2 implements this. A good way to think about .NET Standard is as an interface that a class would implement. The interface defines an abstract API but the concrete implementation of that API is left up to the classes that inherit from it. Another way to think about it is like the HTML5 standard that is supported by different web browsers. Version 2 of the .NET Standard was defined by looking at the intersection of the .NET Framework and Mono. This standard was then implemented by .NET Core 2, which is why is contains so many more APIs than version 1. Version 4.6.1 of the .NET Framework also implements .NET Standard 2 and there is work to support the latest versions of the .NET Framework, UWP and Xamarin (including Xamarin.Forms). There is also the new XAML Standard that aims to find the common ground between Xamarin.Forms and UWP. Hopefully it will include Windows Presentation Foundation (WPF) in the future. If you create libraries and packages that use these standards then they will work on all the platforms that support them. As a developer simply consuming libraries, you don't need to worry about these standards. It just means that you are more likely to be able to use the packages that you want, on the platforms you are working with. New C# features It not just the frameworks and libraries that have been worked on. The underlying language has also had some nice new features added. We will focus on C# here as it is the most popular language for the Common Language Runtime (CLR). Other options include Visual Basic and the functional programming language F#. C# is a great language to work with, especially when compared to a language like JavaScript. Although JavaScript is great for many reasons (such as its ubiquity and the number of frameworks available), the elegance and design of the language is not one of them. Many of these new features are just syntactic sugar, which means they don't add any new functionality. They simply provide a more succinct and easier to read way of writing code that does the same thing. C# 6 Although the latest version of C# is 7, there are some very handy features in C# 6 that often go underused. Also, some of the new additions in 7 are improvements on features added in 6 and would not make much sense without context. We will quickly cover a few features of C# 6 here, in case you are unaware of how useful they can be. String interpolation String interpolation is a more elegant and easier to work with version of the familiar string format method. Instead of supplying the arguments to embed in the string placeholders separately, you can now embed them directly in the string. This is far more readable and less error prone. Let us demonstrate with an example. Consider the following code that embeds an exception in a string. catch (Exception e) { Console.WriteLine("Oh dear, oh dear! {0}", e); } This embeds the first (and in this case only) object in the string at the position marked by the zero. It may seem simple but this quickly gets complex if you have many objects and want to add another at the start. You then have to correctly renumber all the placeholders. Instead you can now prefix the string with a dollar character and embed the object directly in it. This is shown in the following code that behaves the same as the previous example. catch (Exception e) { Console.WriteLine($"Oh dear, oh dear! {e}"); } The ToString() method on an exception outputs all the required information including name, message, stack trace and any inner exceptions. There is no need to deconstruct it manually, you may even miss things if you do. You can also use the same format strings as you are used to. Consider the following code that formats a date in a custom manner. Console.WriteLine($"Starting at: {DateTimeOffset.UtcNow:yyyy/MM/ddHH:mm:ss}"); When this feature was being built, the syntax was slightly different. So be wary of any old blog posts or documentation that may not be correct. Null conditional The null conditional operator is a way of simplifying null checks. You can now inline a check for null rather than using an if statement or ternary operator. This makes it easier to use in more places and will hopefully help you to avoid the dreaded null reference exception. You can avoid doing a manual null check like in the following code. int? length = (null == bytes) ? null : (int?)bytes.Length; This can now be simplified to the following statement by adding a question mark. int? length = bytes?.Length; Exception filters You can filter exceptions more easily with the when keyword. You no longer need to catch every type of exception that you are interested in and then filter manually inside the catch block. This is a feature that was already present in VB and F# so it's nice that C# has finally caught up. There are some small benefits to this approach. For example, if your filter is not matched then the exception can still be caught by other catch blocks in the same try statement. You also don't need to remember to re-throw the exception to avoid it being swallowed. This helps with debugging, as Visual Studio will no longer break, as it would when you throw. For example, you could check to see if there is a message in the exception and handle it differently, as shown here. catch (Exception e) when (e?.Message?.Length> 0) When this feature was in development, a different keyword (if) was used. So be careful of any old information online. One thing to keep in mind is that relying on a particular exception message is fragile. If your application is localized then the message may be in a different language than what you expect. This holds true outside of exception filtering too. Asynchronous availability Another small improvement is that you can use the await keyword inside catch and finally blocks. This was not initially allowed when this incredibly useful feature was added in C# 5. There is not a lot more to say about this. The implementation is complex but you don't need to worry about this unless you're interested in the internals. From a developer point of view, it just works, as in this trivial example. catch (Exception e) when (e?.Message?.Length> 0) { await Task.Delay(200); } This feature has been improved in C# 7, so read on. You will see async and await used a lot. Asynchronous programming is a great way of improving performance and not just from within your C# code. Expression bodies Expression bodies allow you to assign an expression to a method or getter property using the lambda arrow operator (=>) that you may be familiar with from fluent LINQ syntax. You no longer need to provide a full statement or method signature and body. This feature has also been improved in C# 7 so see the examples in the next section. For example, a getter property can be implemented like so. public static string Text => $"Today: {DateTime.Now:o}"; A method can be written in a similar way, such as the following example. private byte[] GetBytes(string text) => Encoding.UTF8.GetBytes(text); C# 7 The most recent version of the C# language is 7 and there are yet more improvements to readability and ease of use. We'll cover a subset of the more interesting changes here. Literals There are couple of minor additional capabilities and readability enhancements when specifying literal values in code. You can specify binary literals, which means you don't have to work out how to represent them using a different base anymore. You can also put underscores anywhere within a literal to make it easier to read the number. The underscores are ignored but allow you to separate digits into convention groupings. This is particularly well suited to the new binary literal as it can be very verbose listing out all those zeros and ones. Take the following example using the new 0b prefix to specify a binary literal that will be rendered as an integer in a string. Console.WriteLine($"Binary solo! {0b0000001_00000011_000000111_00001111}"); You can do this with other bases too, such as this integer, which is formatted to use a thousands separator. Console.WriteLine($"Over {9_000:#,0}!"); // Prints "Over 9,000!" Tuples One of the big new features in C# 7 is support for tuples. Tuples are groups of values and you can now return them directly from method calls. You are no longer restricted to returning a single value. Previously you could work around this limitation in a few sub-optimal ways, including creating a custom complex object to return, perhaps with a Plain Old C# Object (POCO) or Data Transfer Object (DTO), which are the same thing. You could have also passed in a reference using the ref or out keywords, which although there are improvements to the syntax are still not great. There was System.Tuple in C# 6 but this wasn't ideal. It was a framework feature, rather than a language feature and the items were only numbered and not named. With C# 7 tuples, you can name the objects and they make a great alternative to anonymous types, particularly in LINQ query expression lambda functions. As an example, if you only want to work on a subset of the data available, perhaps when filtering a database table with an O/RM such as Entity Framework, then you could use a tuple for this. The following example returns a tuple from a method. You may need to add the System.ValueTupleNuGet package for this to work. private static (int one, string two, DateTime three) GetTuple() { return (one: 1, two: "too", three: DateTime.UtcNow); } You can also use tuples in string interpolation and all the values are rendered, as shown here. Console.WriteLine($"Tuple = {GetTuple()}"); Out variables If you did want to pass parameters into a method for modification then you have always needed to declare them first. This is no longer necessary and you can simply declare the variables at the point you pass them in. You can also declare a variable to be discarded by using an underscore. This is particularly useful if you don't want to use the returned value, for example in some of the try parse methods of the native framework data types. Here we parse a date without declaring the dt variable first. DateTime.TryParse("2017-08-09", out var dt); In this example we test for an integer but we don't care what it is. var isInt = int.TryParse("w00t", out _); References You can now return values by reference from a method as well as consume them. This is a little like working with pointers in C but safer. For example, you can only return references that were passed into the method and you can't modify references to point to a different location in memory. This is a very specialist feature but in certain niche situations it can dramatically improve performance. Given the following method. private static ref string GetFirstRef(ref string[] texts) { if (texts?.Length> 0) { return ref texts[0]; } throw new ArgumentOutOfRangeException(); } You could call it like so, and the second console output line would appear differently (one instead of 1). var strings = new string[] { "1", "2" }; ref var first = ref GetFirstRef(ref strings); Console.WriteLine($"{strings?[0]}"); // 1 first = "one"; Console.WriteLine($"{strings?[0]}"); // one Patterns The other big addition is you can now match patterns in C# 7 using the is keyword. This simplifies testing for null and matching against types, among other things. It also lets you easily use the cast value. This is a simpler alternative to using full polymorphism (where a derived class can be treated as a base class and override methods). However, if you control the code base and can make use of proper polymorphism, then you should still do this and follow good Object-Oriented Programming (OOP) principles. In the following example, pattern matching is used to parse the type and value of an unknown object. private static int PatternMatch(object obj) { if (obj is null) { return 0; } if (obj is int i) { return i++; } if (obj is DateTime d || (obj is string str && DateTime.TryParse(str, out d))) { return d.DayOfYear; } return -1; } You can also use pattern matching in the cases of a switch statement and you can switch on non-primitive types such as custom objects. More expression bodies Expression bodies are expanded from the offering in C# 6 and you can now use them in more places, for example as object constructors and property setters. Here we extend our previous example to include setting the value on the property we were previously just reading. private static string text; public static string Text { get => text ?? $"Today: {DateTime.Now:r}"; set => text = value; } More asynchronous improvements There have been some small improvements to what async methods can return and, although small, they could offer big performance gains in certain situations. You no longer have to return a task, which can be beneficial if the value is already available. This can reduce the overheads of using async methods and creating a task object. JavaScript You can't write a book on web applications without covering JavaScript. It is everywhere. If you write a web app that does a full page load on every request and it's not a simple content site then it will feel slow. Users expect responsiveness. If you are a back-end developer then you may think that you don't have to worry about this. However, if you are building an API then you may want to make it easy to consume with JavaScript and you will need to make sure that your JSON is correctly and quickly serialized. Even if you are building a Single Page Application (SPA) in JavaScript (or TypeScript) that runs in the browser, the server can still play a key role. You can use SPA services to run Angular or React on the server and generate the initial output. This can increase performance, as the browser has something to render immediately. For example, there is a project called React.NET that integrates React with ASP.NET, and it supports ASP.NET Core. If you have been struggling to keep up with the latest developments in the .NET world then JavaScript is on another level. There seems to be something new almost every week and this can lead to framework fatigue and the paradox of choice. There is so much to choose from that you don't know what to pick. Summary In this article, you have seen a brief high-level summary of what has changed in .NET Core 2 and ASP.NET Core 2, compared to the previous versions. You are also now aware of .NET Standard 2 and what it is for. We have shown examples of some of the new features available in C# 6 and C# 7. These can be very useful in letting you write more with less, and in making your code more readable and easier to maintain.
Read more
  • 0
  • 0
  • 15312

article-image-working-forensic-evidence-container-recipes
Packt
07 Mar 2018
13 min read
Save for later

Working with Forensic Evidence Container Recipes

Packt
07 Mar 2018
13 min read
In this article by Preston Miller and Chapin Bryce, authors of Learning Python for Forensics, we introduce a recipe from our upcoming book, Python Digital Forensics Cookbook. In Python Digital Forensics Cookbook, each chapter is comprised of many scripts, or recipes, falling under specific themes. The "Iterating Through Files" recipe shown here, is from our chapter that introduces the Sleuth Kit's Python binding's, pystk3, and other libraries, to programmatically interact with forensic evidence containers. Specifically, this recipe shows how to access a forensic evidence container and iterate through all of its files to create an active file listing of its contents. (For more resources related to this topic, see here.) If you are reading this article, it goes without saying that Python is a key tool in DFIR investigations. However, most examiners, are not familiar with or do not take advantage of the Sleuth Kit's Python bindings. Imagine being able to run your existing scripts against forensic containers without needing to mount them or export loose files. This recipe continues to introduce the library, pytsk3, that will allow us to do just that and take our scripting capabilities to the next level. In this recipe, we learn how to recurse through the filesystem and create an active file listing. Oftentimes, one of the first questions we, as the forensic examiner, are asked is "What data is on the device?". An active file listing comes in handy here. Creating a file listing of loose files is a very straightforward task in Python. However, this will be slightly more complicated because we are working with a forensic image rather than loose files. This recipe will be a cornerstone for future scripts as it will allow us to recursively access and process every file in the image. As we continue to introduce new concepts and features from the Sleuth Kit, we will add new functionality to our previous recipes in an iterative process. In a similar way, this recipe will become integral in future recipes to iterate through directories and process files. Getting started Refer to the Getting started section in the Opening Acquisitions recipe for information on the build environment and setup details for pytsk3 and pyewf. All other libraries used in this script are present in Python's standard library. How to do it... We perform the following steps in this recipe: Import argparse, csv, datetime, os, pytsk3, pyewf, and sys; Identify if the evidence container is a raw (DD) image or an EWF (E01) container; Access the forensic image using pytsk3; Recurse through all directories in each partition; Store file metadata in a list; And write the active file list to a CSV. How it works... This recipe's command-line handler takes three positional arguments: EVIDENCE_FILE, TYPE, OUTPUT_CSV which represents the path to the evidence file, the type of evidence file, and the output CSV file, respectively. Similar to the previous recipe, the optional p switch can be supplied to specify a partition type. We use the os.path.dirname() method to extract the desired output directory path for the CSV file and, with the os.makedirs() function, create the necessary output directories if they do not exist. if __name__ == '__main__': # Command-line Argument Parser parser = argparse.ArgumentParser() parser.add_argument("EVIDENCE_FILE", help="Evidence file path") parser.add_argument("TYPE", help="Type of Evidence", choices=("raw", "ewf")) parser.add_argument("OUTPUT_CSV", help="Output CSV with lookup results") parser.add_argument("-p", help="Partition Type", choices=("DOS", "GPT", "MAC", "SUN")) args = parser.parse_args() directory = os.path.dirname(args.OUTPUT_CSV) if not os.path.exists(directory) and directory != "": os.makedirs(directory) Once we have validated the input evidence file by checking that it exists and is a file, the four arguments are passed to the main() function. If there is an issue with initial validation of the input, an error is printed to the console before the script exits. if os.path.exists(args.EVIDENCE_FILE) and os.path.isfile(args.EVIDENCE_FILE): main(args.EVIDENCE_FILE, args.TYPE, args.OUTPUT_CSV, args.p) else: print("[-] Supplied input file {} does not exist or is not a file".format(args.EVIDENCE_FILE)) sys.exit(1) In the main() function, we instantiate the volume variable with None to avoid errors referencing it later in the script. After printing a status message to the console, we check if the evidence type is an E01 to properly process it and create a valid pyewf handle as demonstrated in more detail in the Opening Acquisitions recipe. Refer to that recipe for more details as this part of the function is identical. The end result is the creation of the pytsk3 handle, img_info, for the user supplied evidence file. def main(image, img_type, output, part_type): volume = None print "[+] Opening {}".format(image) if img_type == "ewf": try: filenames = pyewf.glob(image) except IOError, e: print "[-] Invalid EWF format:n {}".format(e) sys.exit(2) ewf_handle = pyewf.handle() ewf_handle.open(filenames) # Open PYTSK3 handle on EWF Image img_info = ewf_Img_Info(ewf_handle) else: img_info = pytsk3.Img_Info(image) Next, we attempt to access the volume of the image using the pytsk3.Volume_Info() method by supplying it the image handle. If the partition type argument was supplied, we add its attribute ID as the second argument. If we receive an IOError when attempting to access the volume, we catch the exception as e and print it to the console. Notice, however, that we do not exit the script as we often do when we receive an error. We'll explain why in the next function. Ultimately, we pass the volume, img_info, and output variables to the openFS() method. try: if part_type is not None: attr_id = getattr(pytsk3, "TSK_VS_TYPE_" + part_type) volume = pytsk3.Volume_Info(img_info, attr_id) else: volume = pytsk3.Volume_Info(img_info) except IOError, e: print "[-] Unable to read partition table:n {}".format(e) openFS(volume, img_info, output) The openFS() method tries to access the filesystem of the container in two ways. If the volume variable is not None, it iterates through each partition, and if that partition meets certain criteria, attempts to open it. If, however, the volume variable is None, it instead tries to directly call the pytsk3.FS_Info() method on the image handle, img. As we saw, this latter method will work and give us filesystem access for logical images whereas the former works for physical images. Let's look at the differences between these two methods. Regardless of the method, we create a recursed_data list to hold our active file metadata. In the first instance, where we have a physical image, we iterate through each partition and check that is it greater than 2,048 sectors and does not contain the words "Unallocated", "Extended", or "Primary Table" in its description. For partitions meeting these criteria, we attempt to access its filesystem using the FS_Info() function by supplying the pytsk3 img object and the offset of the partition in bytes. If we are able to access the filesystem, we use to open_dir() method to get the root directory and pass that, along with the partition address ID, the filesystem object, two empty lists, and an empty string, to the recurseFiles() method. These empty lists and string will come into play in recursive calls to this function as we will see shortly. Once the recurseFiles() method returns, we append the active file metadata to the recursed_data list. We repeat this process for each partition def openFS(vol, img, output): print "[+] Recursing through files.." recursed_data = [] # Open FS and Recurse if vol is not None: for part in vol: if part.len > 2048 and "Unallocated" not in part.desc and "Extended" not in part.desc and "Primary Table" not in part.desc: try: fs = pytsk3.FS_Info(img, offset=part.start*vol.info.block_size) except IOError, e: print "[-] Unable to open FS:n {}".format(e) root = fs.open_dir(path="/") data = recurseFiles(part.addr, fs, root, [], [], [""]) recursed_data.append(data) We employ a similar method for the second instance, where we have a logical image, where the volume is None. In this case, we attempt to directly access the filesystem and, if successful, we pass that to the recurseFiles() method and append the returned data to our recursed_data list. Once we have our active file list, we send it and the user supplied output file path to the csvWriter() method. Let's dive into the recurseFiles() method which is the meat of this recipe. else: try: fs = pytsk3.FS_Info(img) except IOError, e: print "[-] Unable to open FS:n {}".format(e) root = fs.open_dir(path="/") data = recurseFiles(1, fs, root, [], [], [""]) recursed_data.append(data) csvWriter(recursed_data, output) The recurseFiles() function is based on an example of the FLS tool (https://github.com/py4n6/pytsk/blob/master/examples/fls.py) and David Cowen's Automating DFIR series tool dfirwizard (https://github.com/dlcowen/dfirwizard/blob/master/dfirwiza rd-v9.py). To start this function, we append the root directory inode to the dirs list. This list is used later to avoid unending loops. Next, we begin to loop through each object in the root directory and check that it has certain attributes we would expect and that its name is not either "." or "..". def recurseFiles(part, fs, root_dir, dirs, data, parent): dirs.append(root_dir.info.fs_file.meta.addr) for fs_object in root_dir: # Skip ".", ".." or directory entries without a name. if not hasattr(fs_object, "info") or not hasattr(fs_object.info, "name") or not hasattr(fs_object.info.name, "name") or fs_object.info.name.name in [".", ".."]: continue If the object passes that test, we extract its name using the info.name.name attribute. Next, we use the parent variable, which was supplied as one of the function's inputs, to manually create the file path for this object. There is no built-in method or attribute to do this automatically for us. We then check if the file is a directory or not and set the f_type variable to the appropriate type. If the object is a file, and it has an extension, we extract it and store it in the file_ext variable. If we encounter an AttributeError when attempting to extract this data we continue onto the next object. try: file_name = fs_object.info.name.name file_path = "{}/{}".format("/".join(parent), fs_object.info.name.name) try: if fs_object.info.meta.type == pytsk3.TSK_FS_META_TYPE_DIR: f_type = "DIR" file_ext = "" else: f_type = "FILE" if "." in file_name: file_ext = file_name.rsplit(".")[-1].lower() else: file_ext = "" except AttributeError: continue We create variables for the object size and timestamps. However, notice that we pass the dates to a convertTime() method. This function exists to convert the UNIX timestamps into a human-readable format. With these attributes extracted, we append them to the data list using the partition address ID to ensure we keep track of which partition the object is from size = fs_object.info.meta.size create = convertTime(fs_object.info.meta.crtime) change = convertTime(fs_object.info.meta.ctime) modify = convertTime(fs_object.info.meta.mtime) data.append(["PARTITION {}".format(part), file_name, file_ext, f_type, create, change, modify, size, file_path]) If the object is a directory, we need to recurse through it to access all of its sub-directories and files. To accomplish this, we append the directory name to the parent list. Then, we create a directory object using the as_directory() method. We use the inode here, which is for all intents and purposes a unique number and check that the inode is not already in the dirs list. If that were the case, then we would not process this directory as it would have already been processed. If the directory needs to be processed, we call the recurseFiles() method on the new sub_directory and pass it current dirs, data, and parent variables. Once we have processed a given directory, we pop that directory from the parent list. Failing to do this will result in false file path details as all of the former directories will continue to be referenced in the path unless removed. Most of this function was under a large try-except block. We pass on any IOError exception generated during this process. Once we have iterated through all of the subdirectories, we return the data list to the openFS() function. if f_type == "DIR": parent.append(fs_object.info.name.name) sub_directory = fs_object.as_directory() inode = fs_object.info.meta.addr # This ensures that we don't recurse into a directory # above the current level and thus avoid circular loops. if inode not in dirs: recurseFiles(part, fs, sub_directory, dirs, data, parent) parent.pop(-1) except IOError: pass dirs.pop(-1) return data Let's briefly look at the convertTime() function. We've seen this type of function before, if the UNIX timestamp is not 0, we use the datetime.utcfromtimestamp() method to convert the timestamp into a human-readable format. def convertTime(ts): if str(ts) == "0": return "" return datetime.utcfromtimestamp(ts) With the active file listing data in hand, we are now ready to write it to a CSV file using the csvWriter() method. If we did find data (i.e., the list is not empty), we open the output CSV file, write the headers, and loop through each list in the data variable. We use the csvwriterows() method to write each nested list structure to the CSV file. def csvWriter(data, output): if data == []: print "[-] No output results to write" sys.exit(3) print "[+] Writing output to {}".format(output) with open(output, "wb") as csvfile: csv_writer = csv.writer(csvfile) headers = ["Partition", "File", "File Ext", "File Type", "Create Date", "Modify Date", "Change Date", "Size", "File Path"] csv_writer.writerow(headers) for result_list in data: csv_writer.writerows(result_list) The screenshot below demonstrates the type of data this recipe extracts from forensic images. There's more... For this recipe, there are a number of improvements that could further increase its utility: Use tqdm, or another library, to create a progress bar to inform the user of the current execution progress. Learn about the additional metadata values that can be extracted from filesystem objects using pytsk3 and add them to the output CSV file. Summary In summary, we have learned how to use pytsk3 to recursively iterate through any supported filesystem by the Sleuth Kit. This comprises the basis of how we can use the Sleuth Kit to programmatically process forensic acquisitions. With this recipe, we will now be able to further interact with these files in future recipes. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 16190

article-image-administering-arcgis-enterprise-through-rest-administrative-directories
Chad Cooper
07 Mar 2018
8 min read
Save for later

Administering ARCGIS Enterprise through the REST administrative directories

Chad Cooper
07 Mar 2018
8 min read
This is a guest post written by Chad Cooper. Chad has worked in the geospatial industry over the last 15 years as a technician, analyst, and developer, pertaining to state and local government, oil and gas, and academia. He is also the author of the title Mastering ArcGIS Enterprise Administration, which aims to help you learn to install configure, secure, and fully utilize ArcGIS Enterprise system. ArcGIS Enterprise is one of the most widely used GIS packages in the world. With the 10.5 release, Portal for ArcGIS became a first-class citizen, living alongside ArcGIS Server and playing a major role in management and administration of the web GIS. Data Store for ArcGIS allows for local storage of hosted feature services and is also a major player in the ArcGIS Enterprise ecosystem. The ArcGIS Web Adaptor completes ArcGIS Enterprise and is the fourth major component. These components are new to most users (Portal and Data Store), and they come with an increased level of configuration, complexity and administration. Luckily, there are many ways to administer and manage the ArcGIS Enterprise system. In this article, we will look at a few of those methods. How to access the ArcGIS server REST administrator directory ArcGIS Server exposes its functionality through web services using REST. With this architecture comes the ArcGIS Server REST Application Programming Interface, or API, that, in addition to exposing ArcGIS Server services, exposes every administrative task that ArcGIS Server supports. In the API, ArcGIS Server administrative tasks are considered resources and are accessed through URLs (which are Uniform Resource Locators, after all). Operations act on these resources and update their information or state. Resources and their operations are hierarchical and standardized and have unique URLs. Like the web, the REST API is stateless, meaning that it does not retain information from one request to another by either the sender or receiver. Each request that is sent is expected to contain all the necessary information to process that request. If it does, the server processes the request and sends back a well-defined response. As it is accessed over the web, the ArcGIS Server REST API can also be invoked from any language that can make a web service call, such as Python. Accessing the ArcGIS Server Administrator Directory can be done in several ways, depending upon your Web Adaptor configuration. From the ArcGIS Server machine, the Server Administrator Directory can be accessed at https://localhost:6443/arcgis/admin. There is no shortcut to this URL in the Windows Start menu. From another machine on the internal network, the Server Administrator Directory can be accessed by using the fully qualified domain name, or FQDN, instead of localhost, such as https://server.domain.com:6443/arcgis/admin. If, during your Web Adaptor configuration, you chose to Enable administrative access to your site through the Web Adaptor, you also will be able to access the Server Administrator Directory through your Web Adaptor URL, such as https://www.masteringageadmin.com/arcgis/admin. As with Server Manager, you will login as the primary site administrator (PSA) designated during installation or with other administrator credentials. Prior to ArcGIS 10.1, server configuration was held in plain text configuration files in the configuration store. These files are no longer part of the ArcGIS Server architecture. The ArcGIS Server REST Administrator Directory now exposes these settings. How to use the ArcGIS server REST administrator directory The ArcGIS Server REST Administrator Directory, or “REST Admin” as it will be herein referred to, is a powerful way to manage all aspects of ArcGIS Server administration, as it exposes every administrative task that ArcGIS Server supports. Remember from earlier that the API is organized into resources and operations. Resources are settings within ArcGIS Server and operations act on those resources to update their information or change their well-defined state usually through a HTTP GET or POST method. HTTP GET requests data from a resource while HTTP POST submits data to be processed to a resource. In other words, GET retrieves data, POST inserts/updates data. An example of a resource is a service. An existing service can have a well-defined state of stopped or started, it must be one or the other. Operations available on the service resource in the REST API include Start Service, Stop Service, Edit Service, and Delete Service. The Start, Stop, and Delete operations change the state of the service, from stopped to started and started to stopped, and either stopped or started to deleted (technically if the service is started, it is first stopped before it is deleted) respectively. The Edit Service operation changes the information in the resource. Resources can also have child resources which can in turn have their own set of operations and child resources. Remember that the API is hierarchical, so for example, in the case of a service resource, it has the child resource Item Information, which has the Edit Item Information operation. To get to this operation in the REST Admin, we would login to the REST Admin and go to services | | iteminfo | edit, which would resemble the following in URL form: https://www.masteringageadmin.com/arcgis/admin/services/SampleWorldCities.MapServer/iteminfo/edit In the REST Admin, we could now edit the service Description, Summary, Tags, and Thumbnail: By updating the Item Information in the above example and clicking the Update button, we would be sending an edit HTTP POST operation to the https://www.masteringageadmin.com/arcgis/admin/services/SampleWorldCities.MapServer/iteminfo resource. The ArcGIS Server Manager equivalent for this process would be to go to Services | Manage Services | Edit Service pencil button to the right of service name | Item Description. Hopefully this gives you a better understanding of how the REST API works and how actions carried out in Server Manager and Server are executed by the API on the backend. Administering Portal for ArcGIS through the Portal REST administrative directory Just like ArcGIS Server, Portal has a REST backend from which all administrative tasks can be performed. We previously covered how the web interface for ArcGIS Server is a frontend to the ArcGIS Server REST API, and Portal is no different. We also covered services and how REST calls are made to the API. The Portal Administrative Directory, referred to herein as “Portal Admin”, can be accessed from within the internal network (bypassing the Web Adaptor) at a URL such as: https://<FQDN>:7443/arcgis/portaladmin/ If administrative access is enabled on the Portal Web Adaptor, then we can access Portal Admin outside of our internal network at the Web Adaptor URL such as: https://www.your-domain.com/portal/portaladmin/ To login to Portal Admin as an administrator, enter the Username and Password of an account with administrator privileges at the Portal Administrative Directory Login page and click the Login button. Let’s now look at one administrative action that can be performed in the Portal REST Admin. Portal licensing Information on current Portal licensing can be viewed by going to Home | System | Licenses. Here, information on the validity and expiration of licensing and registered members can be viewed. The Import Entitlements operation allows for the import of entitlements for ArcGIS Pro and additional products such as Business Analyst or Insights. For ArcGIS Pro, the operation requires an entitlements file exported out of My Esri. Once the entitlements have been imported, licenses can be assigned to users within Portal. Entitlements can have effective parts and parts that become effective on a certain date. These all get imported, with the effective parts available immediately and the non-effective parts placed into a queue that Portal will automatically apply once they become effective. To import entitlements for ArcGIS Pro, do the following: Have your entitlements file ready In Portal Admin, go to Home | System | Licenses | Import Entitlements Choose your entitlements file under Choose File For Application, choose ArcGISPro For Format, choose JSON or HTML (this is only the response format) Click Import. Once the entitlements are imported, the licenses can be assigned to users in Portal under My Organization | Manage Licenses.  At its latest release, ArcGIS Enterprise has more components than ever before, resulting in additional setup, configuration, administration, and management requirements. Here, we looked at several ways to access the ArcGIS Server and Portal for ArcGIS REST administrative interfaces. These are a few of the many methods available to interact with your ArcGIS Enterprise system. Check out Mastering ArcGIS Enterprise Administration to learn how to administer ArcGIS Server, Portal, and Data Store through user interfaces, the REST API, and Python scripts.
Read more
  • 0
  • 0
  • 3036

article-image-implementing-matrix-operations-using-scipy-numpy
Pravin Dhandre
07 Mar 2018
5 min read
Save for later

Implementing matrix operations using SciPy and NumPy

Pravin Dhandre
07 Mar 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book includes hands-on recipes for using different components of the SciPy Stack such as NumPy, SciPy, matplotlib, pandas, etc.[/box] In this article, we will discuss how to leverage the power of SciPy and NumPy to perform numerous matrix operations and solve common challenges faced while proceeding with statistical analysis. Matrix operations and functions on two-dimensional arrays Basic matrix operations form the backbone of quite a few statistical analyses—for example, neural networks. In this section, we will be covering some of the most used operations and functions on 2D arrays: Addition Multiplication by scalar Matrix arithmetic Matrix-matrix multiplication Matrix inversion Matrix transposition In the following sections, we will look into the methods of implementing each of them in Python using SciPy/NumPy. How to do it… Let's look at the different methods. Matrix addition In order to understand how matrix addition is done, we will first initialize two arrays: # Initializing an array x = np.array([[1, 1], [2, 2]]) y = np.array([[10, 10], [20, 20]]) Similar to what we saw in a previous chapter, we initialize a 2 x 2 array by using the np.array function. There are two methods by which we can add two arrays. Method 1 A simple addition of the two arrays x and y can be performed as follows: x+y Note that x evaluates to: [[1 1] [2 2]] y evaluates to: [[10 10] [20 20]] The result of x+y would be equal to: [[1+10 1+10] [2+20 2+20]] Finally, this gets evaluated to: [[11 11] [22 22]] Method 2 The same preceding operation can also be performed by using the add function in the numpy package as follows: np.add(x,y) Multiplication by a scalar Matrix multiplication by a scalar can be performed by multiplying the vector with a number. We will perform the same using the following two steps: Initialize a two-dimensional array. Multiply the two-dimensional array with a scalar. We perform the steps, as follows: To initialize a two-dimensional array: x = np.array([[1, 1], [2, 2]]) To multiply the two-dimensional array with the k scalar: k*x For example, if the scalar value k = 2, then the value of k*x translates to: 2*x array([[2, 2], [4, 4]]) Matrix arithmetic Standard arithmetic operators can be performed on top of NumPy arrays too. The operations used most often are: Addition Subtraction Multiplication Division Exponentials The other major arithmetic operations are similar to the addition operation we performed on two matrices in the Matrix addition section earlier: # subtraction x-y array([[ -9, -9], [-18, -18]]) # multiplication x*y array([[10, 10], [40, 40]]) While performing multiplication here, there is an element to element multiplication between the two matrices and not a matrix multiplication (more on matrix multiplication in the next section): # division x/y array([[ 0.1, 0.1], [ 0.1, 0.1]]) # exponential x**y array([[ 1, 1], [1048576, 1048576]], dtype=int32) Matrix-matrix multiplication Matrix to matrix multiplication works in the following way: We have a set of two matrices with the following shape: Matrix A has n rows and m columns and matrix B has m rows and p columns. The matrix multiplication of A and B is calculated as follows: The matrix operation is performed by using the built-in dot function available in NumPy as follows: Initialize the arrays: x=np.array([[1, 1], [2, 2]]) y=np.array([[10, 10], [20, 20]]) Perform the matrix multiplication using the dot function in the numpy package: np.dot(x,y) array([[30, 30], [60, 60]]) The np.dot function does the multiplication in the following way: array([[1*10 + 1*20, 1*10 + 1*20], [2*10 + 2*20, 2*10 + 2*20]]) Whenever matrix multiplication happens, the number of columns in the first matrix should be equal to the number of rows in the second matrix. Matrix transposition Matrix transposition is performed by using the transpose function available in numpy package. The process to generate the transpose of a matrix is as follows: Initialize a matrix: A = np.array([[1,2],[3,4]]) Calculate the transpose of the matrix: A.transpose() array([[1, 3], [2, 4]]) The transpose of a matrix with m rows and n columns would be a matrix with n rows and m columns Matrix inversion While we performed most of the basic arithmetic operations on top of matrices earlier, we have not performed any specialist functions within scientific computing/analysis—for example, matrix inversion, transposition, ranking of a matrix, and so on. The other functions available within the scipy package shine through (over and above the previously discussed functions) in such a scenario where more data manipulation is required apart from the standard ones. Matrix inversion can be performed by using the function available in scipy.linalg. The process to perform matrix inversion and its implementation in Python is as follows: Import relevant packages and classes/functions within a package: from scipy import linalg Initialize a matrix: A = np.array([[1,2],[3,4]]) Pass the initialized matrix through the inverse function in package: linalg.inv(A) array([[-2. , 1. ], [ 1.5, -0.5]]) We saw how to easily perform implementation of all the basic matrix operations with Python’s scientific library - SciPy. You may check out this book SciPy Recipes to perform advanced computing tasks like Discrete Fourier Transform and K-means with the SciPy stack.
Read more
  • 0
  • 0
  • 71101
article-image-introduction-aspnet-core-web-api
Packt
07 Mar 2018
13 min read
Save for later

Introduction to ASP.NET Core Web API

Packt
07 Mar 2018
13 min read
In this article by MithunPattankarand MalendraHurbuns, the authors of the book, Mastering ASP.NET Web API,we will start with a quick recap of MVC. We will be looking at the following topics:  Quick recap of MVC framework  Why Web APIs were incepted and it's evolution?  Introduction to .NET Core?  Overview of ASP.NET Core architecture (For more resources related to this topic, see here.) Quick recap of MVC framework Model-View-Controller (MVC) is a powerful and elegant way of separating concerns within an application and applies itself extremely well to web applications. With ASP.NETMVC, it's translated roughly as follows: Models (M): These are the classes that represent the domain you are interested in. These domain objects often encapsulate data stored in a database as well as code that manipulates the data and enforces domain-specific business logic. With ASP.NETMVC, this is most likely a Data Access Layer of some kind, using a tool like Entity Framework or NHibernate or classic ADO.NET.  View (V): This is a template to dynamically generate HTML.  Controller(C): This is a special class that manages the relationship between the View and the Model. It responds to user input, talks to the Model, and decides which view to render (if any). In ASP.NETMVC, this class is conventionally denoted by the suffix Controller. Why Web APIs were incepted and it's evolution? Looking back to days when ASP.NETASMX-based XML web service was widely used for building service-oriented applications, it was easiest way to create SOAP-based service which can be used by both .NET applications and non .NET applications. It was available only over HTTP. Around 2006, Microsoft released Windows Communication Foundation (WCF).WCF was and even now a powerful technology for building SOA-based applications. It was giant leap in the world of Microsoft .NET world. WCF was flexible enough to be configured as HTTP service, Remoting service, TCP service, and so on. Using Contracts of WCF, we would keep entire business logic code base same and expose the service as HTTP based or non HTTP based via SOAP/ non SOAP. Until 2010 the ASMX based XML web service or WCF service were widely used in client server based applications, in-fact everything was running smoothly. But the developers of .NET or non .NET community started to feel need for completely new SOA technology for client server applications. Some of reasons behind them were as follows: With applications in production, the amount of data while communicating started to explode and transferring them over the network was bandwidth consuming. SOAP being light weight to some extent started to show signs of payload increase. A few KB SOAP packets were becoming few MBs of data transfer.  Consuming the SOAP service in applications lead to huge applications size because of WSDL and proxy generation. This was even worse when it was used in web applications. Any changes to SOAP services lead to repeat of consuming them by proxy generation. This wasn't easy task for any developers.  JavaScript-based web frameworks were getting released and gaining ground for much simpler way of web development. Consuming SOAP-based services were not that optimal way. Hand-held devices were becoming popular like tablets, smartphones. They had more focused applications and needed very lightweight service oriented approach.  Browser based Single Page Applications (SPA) was gaining ground very rapidly. Using SOAP based services for quite heavy for these SPA. Microsoft released REST based WCF components which can be configured to respond in JSON or XML, but even then it was WCF which was heavy technology to be used.  Applications where no longer just large enterprise services, but there was need was more focused light weight service to be up & running in few days and much easier to use. Any developer who has seen evolving nature of SOA based technologies like ASMX, WCF or any SOAP based felt the need to have much lighter, HTTP based services. HTTP only, JSON compatible POCO based lightweight services was need of the hour and concept of Web API started gaining momentum. What is Web API? A Web API is a programmatic interface to a system that is accessed via standard HTTP methods and headers. A Web API can be accessed by a variety of HTTP clients, including browsers and mobile devices. For Web API to be successful HTTP based service, it needed strong web infrastructure like hosting, caching, concurrency, logging, security etc. One of the best web infrastructure was none other than ASP.NET. ASP.NET either in form Web Form or MVC was widely adopted, so the solid base for web infrastructure was mature enough to be extended as Web API. Microsoft responded to community needs by creating ASP.NET Web API- a super-simple yet very powerful framework for building HTTP-only, JSON-by-default web services without all the fuss of WCF. ASP.NET Web API can be used to build REST based services in matter of minutes and can easily consumed with any front end technologies. It used IIS (mostly) for hosting, caching, concurrency etc. features, it became quite popular. It was launched in 2012 with most basics needs for HTTP based services like convention-based Routing, HTTP Request and Response messages. Later Microsoft released much bigger and better ASP.NET Web API 2 along with ASP.NETMVC 5 in Visual Studio 2013. ASP.NET Web API 2 evolved at much faster pace with these features. Installed via NuGet Installing of Web API 2 was made simpler by using NuGet, either create empty ASP.NET or MVC project and then run command in NuGet Package Manager Console: Install-Package Microsoft.AspNet.WebApi Attribute Routing Initial release of Web API was based on convention-based routing meaning we define one or more route templates and work around it. It's simple without much fuss as routing logic in a single place & it's applied across all controllers. The real world applications are more complicated with resources (controllers/ actions) have child resources like customers having orders, books having authors etc. In such cases convention-based routing is not scalable. Web API 2 introduced a new concept of Attribute Routing which uses attributes in programming languages to define routes. One straight forward advantage is developer has full controls how URIs for Web API are formed. Here is quick snippet of Attribute Routing: [Route("customers/{customerId}/orders")] public IEnumerable<Order>GetOrdersByCustomer(intcustomerId) { ... } For more understanding on this, read Attribute Routing in ASP.NET Web API 2(https://www.asp.net/web-api/overview/web-api-routing-and-actions/attribute-routing-in-web-api-2) OWIN self-host ASP.NET Web API lives on ASP.NET framework, leading to think that it can be hosted on IIS only. The Web API 2 came new hosting package. Microsoft.AspNet.WebApi.OwinSelfHost With this package it can self-hosted outside IIS using OWIN/Katana. CORS (Cross Origin Resource Sharing) Any Web API developed either using .NET or non .NET technologies and meant to be used across different web frameworks, then enabling CORS is must. A must read on CORS&ASP.NET Web API 2 (https://www.asp.net/web-api/overview/security/enabling-cross-origin-requests-in-web-api). IHTTPActionResult and Web API OData improvements are other few notable features which helped evolve Web API 2 as strong technology for developing HTTP based services. ASP.NET Web API 2 has becoming more powerful over the years with C# language improvements like Asynchronous programming using Async/ Await, LINQ, Entity Framework Integration, Dependency Injection with DI frameworks, and so on. ASP.NET into Open Source world Every technology has to evolve with growing needs and advancements in hardware, network and software industry, ASP.NET Web API is no exception to that. Some of the evolution that ASP.NET Web API should undergo from perspectives of developer community, enterprises and end users are: ASP.NETMVC and Web API even though part of ASP.NET stack but their implementation and code base is different. A unified code base reduces burden of maintaining them. It's known that Web API's are consumed by various clients like web applications, Native apps, and Hybrid apps, desktop applications using different technologies (.NET or non .NET). But how about developing Web API in cross platform way, need not rely always on Windows OS/ Visual Studio IDE. Open sourcing the ASP.NET stack so that it's adopted on much bigger scale. End users are benefitted with open source innovations. We saw that why Web APIs were incepted, how they evolved into powerful HTTP based service and some evolutions required. With these thoughts Microsoft made an entry into world of Open Source by launching .NET Core and ASP.NET Core 1.0. What is .NET Core? .NET Core is a cross-platform free and open-source managed software framework similar to .NET Framework. It consists of CoreCLR, a complete cross-platform runtime implementation of CLR. .NET Core 1.0 was released on 27 June 2016 along with Visual Studio 2015 Update 3, which enables .NET Core development. In much simpler terms .NET Core applications can be developed, tested, deployed on cross platforms such as Windows, Linux flavours, macOS systems. With help of .NET Core, we don't really need Windows OS and in particular Visual Studio IDE to develop ASP.NET web applications, command-line apps, libraries, and UWP apps. In short, let's understand .NET Core components: CoreCLR:It is a virtual machine that manages the execution of .NET programs. CoreCLRmeans Core Common Language Runtime, it includes the garbage collector, JIT compiler, base .NET data types and many low-level classes. CoreFX: .NET Core foundational libraries likes class for collections, file systems, console, XML, Async and many others. CoreRT: .NET Core runtime optimized for AOT (ahead of time compilation) scenarios, with the accompanying .NET Native compiler toolchain. Its main responsibility is to do native compilation of code written in any of our favorite .NET programming language. .NET Core shares subset of original .NET framework, plus it comes with its own set of APIs that is not part of .NET framework. This results in some shared APIs that can be used by both .NET core and .NET framework. A .Net Core application can easily work on existing .NET Framework but not vice versa. .NET Core provides a CLI (Command Line Interface) for an execution entry point for operating systems and provides developer services like compilation and package management. The following are the .NET Core interesting points to know: .NET Core can be installed on cross platforms like Windows, Linux, andmacOS. It can be used in device, cloud, and embedded/IoT scenarios.  Visual Studio IDE is not mandatory to work with .NET Core, but when working on Windows OS we can leverage existing IDE knowledge to work.  .NET Core is modular, meaning that instead of assemblies, developers deal with NuGet packages.  .NET Core relies on its package manager to receive updates because cross platform technology can't rely on Windows Updates. To learn .NET Core, we just need a shell, text editor and its runtime installed. .NET Core comes with flexible deployment. It can be included in your app or installed side-by-side user- or machine-wide.  .NET Core apps can also be self-hosted/run as standalone apps. .NET Core supports four cross-platform scenarios--ASP.NET Core web apps, command-line apps, libraries, and Universal Windows Platform apps. It does not implement Windows Forms or WPF which render the standard GUI for desktop software on Windows. At present only C# programming language can be used to write .NET Core apps. F# and VB support are on the way. We will primarily focus on ASP.NET Core web apps which includes MVC and Web API. CLI apps, libraries will be covered briefly. What is ASP.NET Core? A new open-source and cross-platform framework for building modern cloud-based web applications using .NET. ASP.NET Core is completely open-source, you can download it from GitHub. It's cross platform meaning you can develop ASP.NET Core apps on Linux/macOS and of course on Windows OS. ASP.NET was first released almost 15 years back with .NET framework. Since then it's adopted by millions of developers for large, small applications. ASP.NET has evolved with many capabilities. With .NET Core as cross platform, ASP.NET took a huge leap beyond boundaries of Windows OS environment for development and deployment of web applications. ASP.NET Core overview                                                ASP.NET Core Architecture overview ASP.NET Core high level overview provides following insights: ASP.NET Core runs both on Full .NET framework and .NET Core.  ASP.NET Core applications with full .NET framework can only be developed and deployed only Windows OS/Server.  When using .NET core, it can be developed and deployed on platform of choice. The logos of Windows, Linux, macOSindicates that you can work with ASP.NET Core.  ASP.NET Core when on non-Windows machine, use the .NET Core libraries to run the applications. It's obvious you won't have all full .NET libraries but most of them are available.  Developers working on ASP.NET Core can easily switch working on any machine not confined to Visual Studio 2015 IDE. ASP.NET Core can run with different version of .NET Core. ASP.NET Core has much more foundational improvements apart from being cross-platform, we gain following advantages of using ASP.NET Core: Totally Modular: ASP.NET Core takes totally modular approach for application development, every component needed to build application are well factored into NuGet packages. Only add required packages through NuGet to keep overall application lightweight.  ASP.NET Core is no longer based on System.Web.dll. Choose your editors and tools: Visual Studio IDE was used to develop ASP.NET applications on Windows OS box, now since we have moved beyond the Windows world. Then we will require IDE/editors/ Tools required for developingASP.NET applications on Linux/macOS. Microsoft developed powerful lightweight code editors for almost any type of web applications called as Visual Studio Code.  ASP.NET Core is such a framework that we don't need Visual Studio IDE/ code to develop applications. We can use code editors like Sublime, Vim also. To work with C# code in editors, installed and use OmniSharp plugin.  OmniSharp is a set of tooling, editor integrations and libraries that together create an ecosystem that allows you to have a great programming experience no matter what your editor and operating system of choice may be.  Integration with modern web frameworks: ASP.NET Core has powerful, seamless integration with modern web frameworks like Angular, Ember, NodeJS, and Bootstrap.  Using bower andNPM, we can work with modern web frameworks.  Cloud ready: ASP.NET Core apps are cloud ready with configuration system, it just seamlessly gets transitioned from on-premises to cloud.  Built in Dependency Injection. Can be hosted on IIS or self-host in your own process or on nginx.  New light-weight and modular HTTP request pipeline. Unified code base for Web UI and Web APIs. We will see more on this when we explore anatomy of ASP.NET Core application. Summary So in this article we covered MVC framework and introduced .NET Core and its architecture. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 12008

article-image-logistic-regression-using-tensorflow
Packt
06 Mar 2018
9 min read
Save for later

Logistic Regression Using TensorFlow

Packt
06 Mar 2018
9 min read
In this article, by PKS Prakash and Achyutuni Sri Krishna Rao, authors of R Deep Learning Cookbook we will learn how to Perform logistic regression using TensorFlow. In this recipe, we will cover the application of TensorFlow in setting up a logistic regression model. The example will use a similar dataset to that used in the H2O model setup. (For more resources related to this topic, see here.) What is TensorFlow TensorFlow is another open source library developed by the Google Brain Team to build numerical computation models using data flow graphs. The core of TensorFlow was developed in C++ with the wrapper in Python. The tensorflow package in R gives you access to the TensorFlow API composed of Python modules to execute computation models. TensorFlow supports both CPU- and GPU-based computations. The tensorflow package in R calls the Python tensorflow API for execution, which is essential to install the tensorflow package in both R and Python to make R work. The following are the dependencies for tensorflow: Python 2.7 / 3.x  R (>3.2) devtools package in R for installing TensorFlow from GitHub  TensorFlow in Python pip Getting ready The code for this section is created on Linux but can be run on any operating system. To start modeling, load the tensorflow package in the environment. R loads the default TensorFlow environment variable and also the NumPy library from Python in the np variable:  library("tensorflow") # Load TensorFlow np <- import("numpy") # Load numpy library How to do it... The data is imported using a standard function from R, as shown in the following code. The data is imported using the read.csv file and transformed into the matrix format followed by selecting the features used to model as defined in xFeatures and yFeatures. The next step in TensorFlow is to set up a graph to run optimization: # Loading input and test data xFeatures = c("Temperature", "Humidity", "Light", "CO2", "HumidityRatio") yFeatures = "Occupancy" occupancy_train <- as.matrix(read.csv("datatraining.txt",stringsAsFactors = T)) occupancy_test <- as.matrix(read.csv("datatest.txt",stringsAsFactors = T)) # subset features for modeling and transform to numeric values occupancy_train<-apply(occupancy_train[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) occupancy_test<-apply(occupancy_test[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) # Data dimensions nFeatures<-length(xFeatures) nRow<-nrow(occupancy_train) Before setting up the graph, let's reset the graph using the following command: # Reset the graph tf$reset_default_graph() Additionally, let's start an interactive session as it will allow us to execute variables without referring to the session-to-session object: # Starting session as interactive session sess<-tf$InteractiveSession() Define the logistic regression model in TensorFlow: # Setting-up Logistic regression graph x <- tf$constant(unlist(occupancy_train[, xFeatures]), shape=c(nRow, nFeatures), dtype=np$float32) # W <- tf$Variable(tf$random_uniform(shape(nFeatures, 1L))) b <- tf$Variable(tf$zeros(shape(1L))) y <- tf$matmul(x, W) + b The input feature x is defined as a constant as it will be an input to the system. The weight W and bias b are defined as variables that will be optimized during the optimization process. The y is set up as a symbolic representation between x, W, and b. The weight W is set up to initialize random uniform distribution and b is assigned the value zero.  The next step is to set up the cost function for logistic regression: # Setting-up cost function and optimizer y_ <- tf$constant(unlist(occupancy_train[, yFeatures]), dtype="float32", shape=c(nRow, 1L)) cross_entropy<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=y_, logits=y, name="cross_entropy")) optimizer <- tf$train$GradientDescentOptimizer(0.15)$minimize(cross_entr opy) # Start a session init <- tf$global_variables_initializer() sess$run(init) Execute the gradient descent algorithm for the optimization of weights using cross entropy as the loss function: # Running optimization for (step in 1:5000) { sess$run(optimizer) if (step %% 20== 0) cat(step, "-", sess$run(W), sess$run(b), "==>", sess$run(cross_entropy), "n") } How it works... The performance of the model can be evaluated using AUC: # Performance on Train library(pROC) ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) # Performance on test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)). AUC can be visualized using the plot.auc function from the pROC package, as shown in the screenshot following this command. The performance for training and testing (holdout) is very similar. plot.roc(roc_obj, col = "green", lty=2, lwd=2) plot.roc(roc_objt, add=T, col="red", lty=4, lwd=2) Performance of logistic regression using TensorFlow Visualizing TensorFlow graphs TensorFlow graphs can be visualized using TensorBoard. It is a service that utilizes TensorFlow event files to visualize TensorFlow models as graphs. Graph model visualization in TensorBoard is also used to debug TensorFlow models. Getting ready TensorBoard can be started using the following command in the terminal: $ tensorboard --logdir home/log --port 6006 The following are the major parameters for TensorBoard: --logdir : To map to the directory to load TensorFlow events --debug: To increase log verbosity  --host: To define the host to listen to its localhost (127.0.0.1) by default  --port: To define the port to which TensorBoard will serve The preceding command will launch the TensorFlow service on localhost at port 6006, as shown in the following screenshot:                                                                                                                                                         TensorBoard The tabs on the TensorBoard capture relevant data generated during graph execution. How to do it... The section covers how to visualize TensorFlow models and output in TernsorBoard.  To visualize summaries and graphs, data from TensorFlow can be exported using the FileWriter command from the summary module. A default session graph can be added using the following command:  # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) The graph for logistic regression developed using the preceding code is shown in the following screenshot:                                                                                 Visualization of the logistic regression graph in TensorBoard Similarly, other variable summaries can be added to the TensorBoard using correct summaries, as shown in the following code: # Adding histogram summary to weight and bias variable w_hist = tf$histogram_summary("weights", W) b_hist = tf$histogram_summary("biases", b) Create a cross entropy evaluation for test. An example script to generate the cross entropy cost function for test and train is shown in the following command: # Set-up cross entropy for test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- tf$nn$sigmoid(tf$matmul(xt, W) + b) yt_ <- tf$constant(unlist(occupancy_test[, yFeatures]), dtype="float32", shape=c(nRowt, 1L)) cross_entropy_tst<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=yt_, logits=ypredt, name="cross_entropy_tst")) Add summary variables to be collected: # Add summary ops to collect data w_hist = tf$summary$histogram("weights", W) b_hist = tf$summary$histogram("biases", b) crossEntropySummary<-tf$summary$scalar("costFunction", cross_entropy) crossEntropyTstSummary<- tf$summary$scalar("costFunction_test", cross_entropy_tst) Open the writing object, log_writer. It writes the default graph to the location, c:/log: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) Run the optimization and collect the summaries: for (step in 1:2500) { sess$run(optimizer) # Evaluate performance on training and test data after 50 Iteration if (step %% 50== 0){ ### Performance on Train ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) ### Performance on Test ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)) cat("train AUC: ", auc(roc_obj), " Test AUC: ", auc(roc_objt), "n") # Save summary of Bias and weights log_writer$add_summary(sess$run(b_hist), global_step=step) log_writer$add_summary(sess$run(w_hist), global_step=step) log_writer$add_summary(sess$run(crossEntropySummary), global_step=step) log_writer$add_summary(sess$run(crossEntropyTstSummary), global_step=step) } } Collect all the summaries to a single tensor using the merge_all command from the summary module: summary = tf$summary$merge_all() Write the summaries to the log file using the log_writer object: log_writer = tf$summary$FileWriter('c:/log', sess$graph) summary_str = sess$run(summary) log_writer$add_summary(summary_str, step) log_writer$close() Summary In this article, we have learned how to perform logistic regression using TensorFlow also we have covered the application of TensorFlow in setting up a logistic regression model. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 4403

article-image-implement-long-short-term-memory-lstm-tensorflow
Gebin George
06 Mar 2018
4 min read
Save for later

Implement Long-short Term Memory (LSTM) with TensorFlow

Gebin George
06 Mar 2018
4 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials written by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book will help you get started with the essentials of deep learning and neural network modeling.[/box] In today’s tutorial, we will look at an example of using LSTM in TensorFlow to perform sentiment classification. The input to LSTM will be a sentence or sequence of words. The output of LSTM will be a binary value indicating a positive sentiment with 1 and a negative sentiment with 0. We will use a many-to-one LSTM architecture for this problem since it maps multiple inputs onto a single output. Figure LSTM: Basic cell architecture shows this architecture in more detail. As shown here, the input takes a sequence of word tokens (in this case, a sequence of three words). Each word token is input at a new time step and is input to the hidden state for the corresponding time step. For example, the word Book is input at time step t and is fed to the hidden state ht: Sentiment analysis: To implement this model in TensorFlow, we need to first define a few variables as follows: batch_size = 4 lstm_units = 16 num_classes = 2 max_sequence_length = 4 embedding_dimension = 64 num_iterations = 1000 As shown previously, batch_size dictates how many sequences of tokens we can input in one batch for training. lstm_units represents the total number of LSTM cells in the network. max_sequence_length represents the maximum possible length of a given sequence. Once defined, we now proceed to initialize TensorFlow-specific data structures for input data as follows: import tensorflow as tf labels = tf.placeholder(tf.float32, [batch_size, num_classes]) raw_data = tf.placeholder(tf.int32, [batch_size, max_sequence_length]) Given we are working with word tokens, we would like to represent them using a good feature representation technique. Let us assume the word embedding representation takes a word token and projects it onto an embedding space of dimension, embedding_dimension. The two-dimensional input data containing raw word tokens is now transformed into a three-dimensional word tensor with the added dimension representing the word embedding. We also use pre-computed word embedding, stored in a word_vectors data structure. We initialize the data structures as follows: data = tf.Variable(tf.zeros([batch_size, max_sequence_length, embedding_dimension]),dtype=tf.float32) data = tf.nn.embedding_lookup(word_vectors,raw_data) Now that the input data is ready, we look at defining the LSTM model. As shown previously, we need to create lstm_units of a basic LSTM cell. Since we need to perform a classification at the end, we wrap the LSTM unit with a dropout wrapper. To perform a full temporal pass of the data on the defined network, we unroll the LSTM using a dynamic_rnn routine of TensorFlow. We also initialize a random weight matrix and a constant value of 0.1 as the bias vector, as follows: weight = tf.Variable(tf.truncated_normal([lstm_units, num_classes])) bias = tf.Variable(tf.constant(0.1, shape=[num_classes])) lstm_cell = tf.contrib.rnn.BasicLSTMCell(lstm_units) wrapped_lstm_cell = tf.contrib.rnn.DropoutWrapper(cell=lstm_cell, output_keep_prob=0.8) output, state = tf.nn.dynamic_rnn(wrapped_lstm_cell, data, dtype=tf.float32) Once the output is generated by the dynamic unrolled RNN, we transpose its shape, multiply it by the weight vector, and add a bias vector to it to compute the final prediction value: output = tf.transpose(output, [1, 0, 2]) last = tf.gather(output, int(output.get_shape()[0]) - 1) prediction = (tf.matmul(last, weight) + bias) weight = tf.cast(weight, tf.float64) last = tf.cast(last, tf.float64) bias = tf.cast(bias, tf.float64) Since the initial prediction needs to be refined, we define an objective function with crossentropy to minimize the loss as follows: loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits (logits=prediction, labels=labels)) optimizer = tf.train.AdamOptimizer().minimize(loss) After this sequence of steps, we have a trained, end-to-end LSTM network for sentiment classification of arbitrary length sentences. To summarize, we saw how effectively we can implement LSTM network using TensorFlow. If you are interested to know more, check out this book Deep Learning Essentials which will help you take first steps in training efficient deep learning models and apply them in various practical scenarios.  
Read more
  • 0
  • 0
  • 16484
article-image-gather-intel-and-plan-attack-strategies
Packt
06 Mar 2018
2 min read
Save for later

Gather Intel and Plan Attack Strategies

Packt
06 Mar 2018
2 min read
In this article by Himanshu Sharma, author of the  Kali Linux - An Ethical Hacker's Cookbook, we will cover the following recipes: Getting a list of subdomains Shodan honeyscore Shodan plugins Using Nmap to find open ports (For more resources related to this topic, see here.) In this article,we'll dive a little deeper and look at other different tools available for gathering intel on our target. We'll start by using some of the infamous tools of Kali Linux, such as Fierce. Gathering information is a very crucial stage of performing a penetration test,as every step we take after this will totally be an outcome of all the information we gather during this stage. So it is very important that we gather as much information as possible before jumping into the exploitation stage. Getting a list of subdomains Not always do we have a situation where a client has defined a full detailed scope of what needs to be pentested. So, we will use the followingrecipes to gather as much information we can to perform a pentest. How to do it… We will see how to get a list of subdomains in the following ways: Fierce We'll start with jumping into Kali's terminal and using the first and mostly widely used tool,Fierce. To launch Fierce,type fierce –h to see the help menu: fierce –dns host.com –threads 10 To perform a subdomain scan, we use this command: fierce –dns host.com –threads 10 Dnsdumpster Dnsdumpster is a free project by HackerTarget to lookup subdomains. It relies on https://scans.io/ for its results. It is pretty simply to use.We type the domain name we want the subdomains for and it will show us the results. Using Shodan for fun and profit Shodan is the world's first search engine to search for devices connected on the Internet. It was launched in 2009 by John Matherly. Shodan can be used to lookup webcams, databases, industrial systems, videogames,and so on. Shodan mostly collects data on the most popular web services running, such as HTTP, HTTPS, MongoDB,and FTP. Getting ready To use Shodan, we will need to create an account. How to do it... Open your browser and visit https://www.shodan.io: We begin by performing a simple search for FTP services running.To do this, we can use the following Shodan dorks: port:"21" This search can be made more specific by specifying a particular country, organization,and so on: port:21 country:"IN" We can now see all the FTP servers running in India.We can also see the servers that allow anonymous login and the version of FTP server they are running. Next, we'll try the organization filter by typing the following: port:21 country:"IN"org:"BSNL" Shodan has other tags aswell, which can be used to perform advanced searches: net: To scan IP ranges city: To filter by city More details can be found at https://www.shodan.io/explore. Shodan honeyscore Shodan Honeyscore is another great project built in Python.It helps us figure out whether an IP address we have is a honeypot or a real system. How to do it... To use Shodan Honeyscore, visit https://honeyscore.shodan.io/: Enter the IP address you want to check, and that's it! Shodan plugins To make our lives even easier,Shodan has plugins for Chrome and Firefox that can be used to check for open ports for websites we visit on the go! How to do it... Download and install the plugin from https://www.shodan.io/. Browse any website, and you will see that by clicking on the plugin,you can see the open ports.   Using Nmap to find open ports Nmap, or Network Mapper, is a security scanner written by Gordon Lyon. It is used to find hosts and services in a network. It first came out in September 1997. Nmap has various features as well as scripts to perform various tests, such as finding the OS andservice version,and it can be used to brute force default logins too. Some of the most common types of scan are as follows: TCP connect()scan SYN stealth scan UDP scan Ping scan Idle scan How to do it... Nmap comespre installed in Kali Linux. We can type the following command to start it and see all the options available: nmap –h To perform a basic scan,use the following command: nmap –sV –Pn x.x.x.x Here, –Pn implies that we do not check whether the host is up or not by performing a ping request first. The –sVparameter is to list all the running services on the found open ports. Another flag we can use is–A , which automatically performs OS detection, version detection, script scanning, and traceroute. The command is as follows: nmap –A –Pn x.x.x.x To scan an IP range or multiple IP's, we can use this command: nmap –A –Pn x.x.x.0/24 Using scripts NSE, or the Nmap scripting engine, allows users to create their own scripts to perform different tasks automatically. These scripts are executed side by side when a scan is run. They can be used to perform more effective version detection,exploitation of a vulnerability, and so on.The command for using a script is this: nmap –Pn –sV host.com –script dns-brute The following is the output of the preceding command: Here,the dns-brutescript tries to fetch available subdomains by brute forcing it against a set of common subdomain names. See also More information on the scripts can be found in the official NSE documentation at https://nmap.org/nsedoc/ Summary In this article, we learned how to get a list of subdomains on the network. Then we learned how to tell whether a system is a honeypot by calculating its Shodan Honeyscore. Chrome and Firefox have plugins that allow you to do this from your browser itself. Finally, we looked at how to use Nmap to find open ports. Resources for Article: Further resources on this subject: Wireless Attacks in Kali Linux [article] Introduction to Penetration Testing and Kali Linux [article] What is Kali Linux [article]
Read more
  • 0
  • 0
  • 5158

article-image-how-to-compute-interpolation-in-scipy
Pravin Dhandre
05 Mar 2018
8 min read
Save for later

How to Compute Interpolation in SciPy

Pravin Dhandre
05 Mar 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides numerous recipes in mastering common tasks related to SciPy and associated libraries such as NumPy, pandas, and matplotlib.[/box] In today’s tutorial, we will see how to compute and solve polynomial, univariate interpolations using SciPy with detailed process and instructions. In this recipe, we will look at how to compute data polynomial interpolation by applying some important methods which are discussed in detail in the coming How to do it... section. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… Let's get started. In the following steps, we will explain how to compute a polynomial interpolation and the things we need to know: They require the following parameters: points: An ndarray of floats, shape (n, D) data point coordinates. It can be either an array of shape (n, D) or a tuple of ndim arrays. values: An ndarray of float or complex shape (n,) data values. xi: A 2D ndarray of float or tuple of 1D array, shape (M, D). Points at which to interpolate data. method: A {'linear', 'nearest', 'cubic'}—This is an optional method of interpolation. One of the nearest return value is at the data point closest to the point of interpolation. See NearestNDInterpolator for more details. linear tessellates the input point set to n-dimensional simplices, and interpolates linearly on each simplex. See LinearNDInterpolator for more details. cubic (1D): Returns the value determined from a cubic spline. cubic (2D): Returns the value determined from a piecewise cubic, continuously differentiable (C1), and approximately curvature-minimizing polynomial surface. See CloughTocher2DInterpolator for more details. fill_value: float; optional. It is the value used to fill in for requested points outside of the convex hull of the input points. If it is not provided, then the default is nan. This option has no effect on the nearest method. rescale: bool; optional. Rescale points to the unit cube before performing interpolation. This is useful if some of the input dimensions have non-commensurable units and differ by many orders of magnitude. How it works… One can see that the exact result is reproduced by all of the methods to some degree, but for this smooth function, the piecewise cubic interpolant gives the best results: import matplotlib.pyplot as plt import numpy as np methods = [None, 'none', 'nearest', 'bilinear', 'bicubic', 'spline16', 'spline36', 'hanning', 'hamming', 'hermite', 'kaiser', 'quadric', 'catrom', 'gaussian', 'bessel', 'mitchell', 'sinc', 'lanczos'] # Fixing random state for reproducibility np.random.seed(19680801) grid = np.random.rand(4, 4) fig, axes = plt.subplots(3, 6, figsize=(12, 6), subplot_kw={'xticks': [], 'yticks': []}) fig.subplots_adjust(hspace=0.3, wspace=0.05) for ax, interp_method in zip(axes.flat, methods): ax.imshow(grid, interpolation=interp_method, cmap='viridis') ax.set_title(interp_method) plt.show() This is the result of the execution: Univariate interpolation In the next section, we will look at how to solve univariate interpolation. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… The following table summarizes the different univariate interpolation modes coded in SciPy, together with the processes that we may use to resolve them: Finding a cubic spline that interpolates a set of data In this recipe, we will look at how to find a cubic spline that interpolates with the main method of spline. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… We can use the following functions to solve the problems with this parameter: x: array_like, shape (n,). A 1D array containing values of the independent variable. The values must be real, finite, and in strictly increasing order. y: array_like. An array containing values of the dependent variable. It can have an arbitrary number of dimensions, but the length along axis must match the length of x. The values must be finite. axis: int; optional. The axis along which y is assumed to be varying, meaning for x[i], the corresponding values are np.take(y, i, axis=axis). The default is 0. bc_type: String or two-tuple; optional. Boundary condition type. Two additional equations, given by the boundary conditions, are required to determine all coefficients of polynomials on each segment. Refer to: https:/​/​docs.​scipy.​org/doc/​scipy-​0.​19.​1/​reference/​generated/​scipy.​interpolate.​CubicSpline.html#r59. If bc_type is a string, then the specified condition will be applied at both ends of a spline. The available conditions are: not-a-knot (default): The first and second segment at a curve end are the same polynomial. This is a good default when there is no information about boundary conditions. periodic: The interpolated function is assumed to be periodic in the period x[-1] - x[0]. The first and last value of y must be identical: y[0] == y[-1]. This boundary condition will result in y'[0] == y'[-1] and y''[0] == y''[-1]. clamped: The first derivatives at the curve ends are zero. Assuming there is a 1D y, bc_type=((1, 0.0), (1, 0.0)) is the same condition. natural: The second derivatives at the curve ends are zero. Assuming there is a 1D y, bc_type=((2, 0.0), (2, 0.0)) is the same condition. If bc_type is two-tuple, the first and the second value will be applied at the curve's start and end respectively. The tuple value can be one of the previously mentioned strings (except periodic) or a tuple (order, deriv_values), allowing us to specify arbitrary derivatives at curve ends: order: The derivative order; it is 1 or 2. deriv_value: An array_like containing derivative values. The shape must be the same as y, excluding the axis dimension. For example, if y is 1D, then deriv_value must be a scalar. If y is 3D with shape (n0, n1, n2) and axis=2, then deriv_value must be 2D and have the shape (n0, n1). extrapolate: {bool, 'periodic', None}; optional. bool, determines whether or not to extrapolate to out-of-bounds points based on first and last intervals, or to return NaNs. periodic, periodic extrapolation is used. If none (default), extrapolate is set to periodic for bc_type='periodic' and to True otherwise. How it works... We have the following example: %pylab inline from scipy.interpolate import CubicSpline import matplotlib.pyplot as plt x = np.arange(10) y = np.sin(x) cs = CubicSpline(x, y) xs = np.arange(-0.5, 9.6, 0.1) plt.figure(figsize=(6.5, 4)) plt.plot(x, y, 'o', label='data') plt.plot(xs, np.sin(xs), label='true') plt.plot(xs, cs(xs), label="S") plt.plot(xs, cs(xs, 1), label="S'") plt.plot(xs, cs(xs, 2), label="S''") plt.plot(xs, cs(xs, 3), label="S'''") plt.xlim(-0.5, 9.5) plt.legend(loc='lower left', ncol=2) plt.show() We can see the result here: We see the next example: theta = 2 * np.pi * np.linspace(0, 1, 5) y = np.c_[np.cos(theta), np.sin(theta)] cs = CubicSpline(theta, y, bc_type='periodic') print("ds/dx={:.1f} ds/dy={:.1f}".format(cs(0, 1)[0], cs(0, 1)[1])) x=0.0 ds/dy=1.0 xs = 2 * np.pi * np.linspace(0, 1, 100) plt.figure(figsize=(6.5, 4)) plt.plot(y[:, 0], y[:, 1], 'o', label='data') plt.plot(np.cos(xs), np.sin(xs), label='true') plt.plot(cs(xs)[:, 0], cs(xs)[:, 1], label='spline') plt.axes().set_aspect('equal') plt.legend(loc='center') plt.show() In the following screenshot, we can see the final result: Defining a B-spline for a given set of control points In the next section, we will look at how to solve B-splines given some controlled data. Getting ready We need to follow some instructions and install the prerequisites. How to do it… Univariate the spline in the B-spline basis Execute the following: S(x)=∑j=0n-1cjBj,k;t(x)S(x)=∑j=0n-1cjBj,k;t(x) Where it's Bj,k;tBj,k;t are B-spline basis functions of degree k and knots t We can use the following parameters: How it works ... Here, we construct a quadratic spline function on the base interval 2 <= x <= 4 and compare it with the naive way of evaluating the spline: from scipy import interpolate import numpy as np import matplotlib.pyplot as plt # sampling x = np.linspace(0, 10, 10) y = np.sin(x) # spline trough all the sampled points tck = interpolate.splrep(x, y) x2 = np.linspace(0, 10, 200) y2 = interpolate.splev(x2, tck) # spline with all the middle points as knots (not working yet) # knots = x[1:-1] # it should be something like this knots = np.array([x[1]]) # not working with above line and just seeing what this line does weights = np.concatenate(([1],np.ones(x.shape[0]-2)*.01,[1])) tck = interpolate.splrep(x, y, t=knots, w=weights) x3 = np.linspace(0, 10, 200) y3 = interpolate.splev(x2, tck) # plot plt.plot(x, y, 'go', x2, y2, 'b', x3, y3,'r') plt.show() Note that outside of the base interval, results differ. This is because BSpline extrapolates the first and last polynomial pieces of B-spline functions active on the base interval. This is the result of solving the problem: We successfully compute numerical computation and find interpolation function using polynomial and univariate interpolation coded in SciPy. If you found this tutorial useful, do check out the book SciPy Recipes to get quick recipes for performing other mathematical operations like differential equation, K-means and Discrete Fourier Transform.
Read more
  • 0
  • 0
  • 11456
Modal Close icon
Modal Close icon