How-To Tutorials

article-image-how-to-perform-audio-video-image-scraping-with-python

08 Mar 2018

9 min read

How to perform Audio-Video-Image Scraping with Python

08 Mar 2018

[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server. Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally. Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It's a simple step from there to also transcode video with ffmpeg. Downloading media content from the web Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content. Getting ready There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file. How to do it Here is how we proceed with the recipe: The URLUtility class can download content from a URL. The code in the recipe's file is the following: import const from util.urls import URLUtility util = URLUtility(const.ApodEclipseImage()) print(len(util.data)) When running this you will see the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014 The example reads 171014 bytes of data. How it works The URL is defined as a constant const.ApodEclipseImage() in the const module: def ApodEclipseImage(): return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg" The constructor of the URLUtility class has the following implementation: def __init__(self, url, readNow=True): """ Construct the object, parse the URL, and download now if specified""" self._url = url self._response = None self._parsed = urlparse(url) if readNow: self.read() The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method: def read(self): self._response = urllib.request.urlopen(self._url) self._data = self._response.read() This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property: @property def data(self): self.ensure_response() return self._data The code then simply reports on the length of that data, with the value of 171014. There's more This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next. Parsing a URL with urllib to get the filename When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name? Getting ready We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py. How to do it Execute the recipe's file with your python interpreter. It will run the following code: util = URLUtility(const.ApodEclipseImage()) print(util.filename_without_ext) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The filename is: BT5643s How it works In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively: >>> parsed = urlparse(const.ApodEclipseImage()) >>> parsed ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='') The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension: @property def filename_without_ext(self): filename = os.path.splitext(os.path.basename(self._parsed.path))[0] return filename The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename). There's more It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That's our next recipe. Determining the type of content for a URL When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content. Getting ready We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py. How to do it We proceed as follows: Execute the script for the recipe. It contains the following code: util = URLUtility(const.ApodEclipseImage()) print("The content type is: " + util.contenttype) With the following result: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The content type is: image/jpeg How it works The .contentype property is implemented as follows: @property def contenttype(self): self.ensure_response() return self._response.headers['content-type'] The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed. There's more The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned: >>> response = urllib.request.urlopen(const.ApodEclipseImage()) >>> for header in response.headers: print(header) Date Server Last-Modified ETag Accept-Ranges Content-Length Connection Content-Type Strict-Transport-Security And we can see the values for each of these headers. >>> for header in response.headers: print(header + " ==> " + response.headers[header]) Date ==> Tue, 26 Sep 2017 19:31:41 GMT Server ==> WebServer/1.0 Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT ETag ==> "547bb44-29c06-5581275ce2b86" Accept-Ranges ==> bytes Content-Length ==> 171014 Connection ==> close Content-Type ==> image/jpeg Strict-Transport-Security ==> max-age=31536000; includeSubDomains Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist. Determining the file extension from a content type It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file. Getting ready We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):. How to do it Proceed by running the recipe's script. An extension for the media type can be found using the .extension property: util = URLUtility(const.ApodEclipseImage()) print("Filename from content-type: " + util.extension_from_contenttype) print("Filename from url: " + util.extension_from_url) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes Filename from content-type: .jpg Filename from url: .jpg This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same. How it works The following is the implementation of the .extension_from_contenttype property: @property def extension_from_contenttype(self): self.ensure_response() map = const.ContentTypeToExtensions() if self.contenttype in map: return map[self.contenttype] return None The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension: def ContentTypeToExtensions(): return { "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/png": ".png" } If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url: @property def extension_from_url(self): ext = os.path.splitext(os.path.basename(self._parsed.path))[1] return ext This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename. To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python. If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.

0
0
33710

article-image-4-common-challenges-web-scraping-handle

Amarabha Banerjee

08 Mar 2018

13 min read

4 common challenges in Web Scraping and how to handle them

Amarabha Banerjee

08 Mar 2018

13 min read

[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] In this article, we will explore primary challenges of Web Scraping and how to get away with it easily. Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds. Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands: docker pull mheydt/pywebscrapecookbook docker run -p 5001:5001 pywebscrapecookbook Retrying failed page downloads Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes: [500, 502, 503, 504, 408] The process can be further configured using the following parameters: RETRY_ENABLED (True/False - default is True) RETRY_TIMES (# of times to retry on any errors - default is 2) RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408]) How to do it The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.retry.RetryMiddleware": 500 }, 'RETRY_ENABLED': True, 'RETRY_TIMES': 3 }) process.crawl(Spider) process.start() How it works Scrapy will pick up the configuration for retries as specified when the spider is run. When encountering errors, Scrapy will retry up to three times before giving up. Supporting page redirects Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters: REDIRECT_ENABLED: (True/False - default is True) REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20) How to do it The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are a few lines demonstrating the output: Parsing: <200 https://www.nasa.gov/content/earth-expeditions-above/> ['http://www.nasa.gov/content/earth-expeditions-above', 'https://www.nasa.gov/content/earth-expeditions-above'] This particular URL was processed after one redirection, from an HTTP to an HTTPS version of the URL. The list defines all of the URLs that were involved in the redirection. You will also be able to see where redirection exceeded the specified level (2) in the output pages. The following is one example: 2017-10-22 17:55:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://www.nasa.gov/topics/journeytomars/news/index.html>: max redirections reached How it works The spider is defined as the following: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] def parse(self, response): print("Parsing: ", response) print (response.request.meta.get('redirect_urls')) This is identical to our previous NASA sitemap based crawler, with the addition of one line printing the redirect_urls. In any call to parse, this metadata will contain all redirects that occurred to get to this page. The crawling process is configured with the following code: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 500 }, 'REDIRECT_ENABLED': True, 'REDIRECT_MAX_TIMES': 2 }) Redirect is enabled by default, but this sets the maximum number of redirects to 2 instead of the default of 20. Waiting for content to be available in Selenium A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asynchronously to the page after loading. Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button: When pressing the button, we are presented with a progress bar for five seconds: And when this is completed, we are presented with Hello World! Now suppose we want to scrape this page to get the content that is exposed only after the button is pressed and after the wait? How do we do this? How to do it We can do this using Selenium. We will use two features of Selenium. The first is the ability to click on page elements. The second is the ability to wait until an element with a specific ID is available on the page. First, we get the button and click it. The button's HTML is the following: <div id='start'> <button>Start</button> </div> When the button is pressed and the load completes, the following HTML is added to the document: <div id='finish'> <h4>Hello World!"</h4> </div> We will use the Selenium driver to find the Start button, click it, and then wait until a div with an ID of 'finish' is available. Then we get that element and return the text in the enclosed <h4> tag. You can try this by running 06/03_press_and_wait.py. It's output will be the following: clicked Hello World! Now let's see how it worked. How it works Let us break down the explanation: We start by importing the required items from Selenium: from selenium import webdriver from selenium.webdriver.support import ui Now we load the driver and the page: driver = webdriver.PhantomJS() driver.get("http://the-internet.herokuapp.com/dynamic_loading/2") With the page loaded, we can retrieve the button: button = driver.find_element_by_xpath("//*/div[@id='start']/button") And then we can click the button: button.click() print("clicked") Next we create a WebDriverWait object: wait = ui.WebDriverWait(driver, 10) With this object, we can request Selenium's UI wait for certain events. This also sets a maximum wait of 10 seconds. Now using this, we can wait until we meet a criterion; that an element is identifiable using the following XPath: wait.until(lambda driver: driver.find_element_by_xpath("//*/div[@id='finish']")) When this completes, we can retrieve the h4 element and get its enclosing text: finish_element=driver.find_element_by_xpath("//*/div[@id='finish']/ h4") print(finish_element.text) Limiting crawling to a single domain We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class. How to do it The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov. How it works The code is the same as previous NASA site crawlers except that we include allowed_domains=['nasa.gov']: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] allowed_domains=['nasa.gov'] def parse(self, response): print("Parsing: ", response) The NASA site is fairly consistent with staying within its root domain, but there are occasional links to other sites such as content on boeing.com. This code will prevent moving to those external sites. Processing infinitely scrolling pages Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart. While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page. Let's look at spidyquotes.herokuapp.com/scroll as an example. Getting ready Open http://spidyquotes.herokuapp.com/scroll in your browser. This page will load additional content when you scroll to the bottom of the page: Screenshot of the quotes to scrape Once the page is open, go into your developer tools and select the network panel. Then, scroll to the bottom of the page. You will see new content in the network panel: When we click on one of the links, we can see the following JSON: { "has_next": true, "page": 2, "quotes": [{ "author": { "goodreads_link": "/author/show/82952.Marilyn_Monroe", "name": "Marilyn Monroe", "slug": "Marilyn-Monroe" }, "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "u201cThis life is what you make it...." }, { "author": { "goodreads_link": "/author/show/1077326.J_K_Rowling", "name": "J.K. Rowling", "slug": "J-K-Rowling" }, "tags": ["courage", "friends"], "text": "u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.u201d" }, This is great because all we need to do is continually generate requests to /api/quotes?page=x, increasing x until the has_next tag exists in the reply document. If there are no more pages, then this tag will not be in the document. How to do it The 06/05_scrapy_continuous.py file contains a Scrapy agent, which crawls this set of pages. Run it with your Python interpreter and you will see output similar to the following (the following is multiple excerpts from the output): <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'Sisters']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'Understand']} When this gets to page 10 it will stop as it will see that there is no next page flag set in the Content. How it works Let's walk through the spider to see how this works. The spider starts with the following definition of the start URL: class Spider(scrapy.Spider): name = 'spidyquotes' quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes' start_urls = [quotes_base_url] download_delay = 1.5 The parse method then prints the response and also parses the JSON into the data variable: def parse(self, response): print(response) data = json.loads(response.body) Then it loops through all the items in the quotes element of the JSON objects. For each item, it yields a new Scrapy item back to the Scrapy engine: for item in data.get('quotes', []): yield { 'text': item.get('text'), 'author': item.get('author', {}).get('name'), 'tags': item.get('tags'), } It then checks to see if the data JSON variable has a 'has_next' property, and if so it gets the next page and yields a new request back to Scrapy to parse the next page: if data['has_next']: next_page = data['page'] + 1 yield scrapy.Request(self.quotes_base_url + "?page=%s" % next_page) There's more... It is also possible to process infinite, scrolling pages using Selenium. The following code is in 06/06_scrape_continuous_twitter.py: from selenium import webdriver import time driver = webdriver.PhantomJS() print("Starting") driver.get("https://twitter.com") scroll_pause_time = 1.5 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: print(last_height) # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(scroll_pause_time) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") print(new_height, last_height) if new_height == last_height: break last_height = new_height The output would be similar to the following: Starting 4882 8139 4882 8139 11630 8139 11630 15055 11630 15055 15055 15055 Process finished with exit code 0 This code starts by loading the page from Twitter. The call to .get() will return when the page is fully loaded. The scrollHeight is then retrieved, and the program scrolls to that height and waits for a moment for the new content to load. The scrollHeight of the browser is retrieved again, and if different than last_height, it will loop and continue processing. If the same as last_height, no new content has loaded and you can then continue on and retrieve the HTML for the completed page. We have discussed the common challenges faced in performing Web Scraping using Python and got to know their workaround. If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Python and perform efficient web scraping.

0
0
35552

article-image-learning-dependency-injection-di

Packt

08 Mar 2018

15 min read

Learning Dependency Injection (DI)

Packt

08 Mar 2018

15 min read

0
0
25875

article-image-spam-filtering-natural-language-processing-approach

Packt

08 Mar 2018

16 min read

Spam Filtering - Natural Language Processing Approach

Packt

08 Mar 2018

16 min read

0
0
7810

article-image-data-explorationusing-spark-sql

Packt

08 Mar 2018

9 min read

Data Exploration using Spark SQL

Packt

08 Mar 2018

9 min read

In this article, Aurobindo Sarkar, the author of the book, Learning Spark SQL, we will be covering the following points to introduce you to using Spark SQL for exploratory data analysis. What is exploratory Data Analysis (EDA)? Why EDA is important? Using Spark SQL for basic data analysis Visualizing data with Apache Zeppelin Introducing Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a “feel”, for the data set as a result of such exploration. More specifically, the graphical techniques allow us to, efficiently, select and validate appropriate models, test our assumptions, identify relationships, select estimators, and detect outliers. EDA involves a lot of trial and error, and several iterations.The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement.For example, you should plot simple histograms and scatterplots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take long to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java, and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer there are many good tools available such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data then this section will offer some good tools and techniques data exploration. Here, we will do some basic data exploration exercises to understand a sample dataset. We will use a dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We use the bank-additional-full.csv file that contains 41188 records and 20 input fields, ordered by date (from May 2008 to November 2010). As a first step, let’s define a schema and read in the CSV file to create a DataFrame. You can use :paste to paste-in the initial set of statements in the Spark shell, as shown in the following figure: After the DataFrame is created, we first verify the number of records. We can also define a case class called Call for our input records, and then create a strongly-typed Dataset as follows: In the next section, we will begin our data exploration by identifying missing data in our dataset. Identifying Missing Data Missing data can occur in datasets due to reasons ranging from negligence to a refusal on part of respondants to provide a specific data point. However, in all cases missing data is a common occurrence in real-world datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies for dealing with it. Here, weanalyzethe numbers of records with missing data fields in our sample dataset. In order to simulate missing data, we will edit our sample dataset by replacing fields containing “unknown” values with empty strings. First, we created a DataFrame / Dataset from our edited file, as shown in the following figure: The following two statements give us a count of rows with certain fields having missing data. Later, we will look at effective ways of dealing with missing data and compute some basic statistics for sample dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a dataset containing a subset of fields from our original DataFrame. In our following example, we choose some of the numeric fields and the outcome field that is the “term deposit subscribed” field. Next, we use describe to quickly compute the count, mean, standard deviation, min and max values for the numeric columns in our dataset. Further, we use the stat package to compute additional statistics like covariance, correlation, creating crosstabs, examining items that occur most frequently in data columns, and computing quantiles. These computations are shown in the following figure: Next, we use the typed aggregation functions to summarize our data to understand our data better. In the following statement, we aggregate the results by whether a term deposit was subscribed along with total customers contacted, average number of calls made per customer, the average duration of the calls and the average number of previous calls made to such customers. The results are rounded to two decimal points. Similarly, executing the following statement givessimilar results by customers’ age. After getting a better understanding of our data by computing basic statistics, we shift our focus to identifying outliers in our data. Identifying data outliers An outlier or an anomalyis an observation of the data that deviates significantly from other observations in the dataset. Erroneous outliers are observations thatare distorted due to possible errors in the data-collection process. These outliers may exert undue influence on the results of statistical analysis, sothey should be identified using reliable detection methods prior to performing data analysis. Many algorithms find outliers as a side-product of clustering algorithms. However these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using a statistical distributions, and the outliers identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically doesnot have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. SparkMLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. In the following example, we use the k-means clustering algorithm to compute two clusters in our data. Other distributed algorithms useful for EDA includeclassification, regression, dimensionality reduction, correlation and hypothesis testing. Visualizing data with Apache Zeppelin Typically, we will generate many graphs to verify our hunches about the data.A lot of thesequick and dirty graphs used during EDA are,ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers, typically, cannot handle millions of data points.Hence we have to summarize, sample or model our data before we can effectively visualize it. Traditionally, BI tools provided extensive aggregation and pivoting features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data wassubsequently downloaded and visualizedon the practitioner’s workstations. Spark can eliminate many of these batch jobs to support interactive data visualization. Here, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is a web-based tool that supports interactive data analysis and visualization. It supports several language interpreters and comes with built-in Spark integration. Hence, it is quick and easy to get started with exploratory data analysis using Apache Zeppelin. You can download Appache Zeppelin from https://zeppelin.apache.org/. Unzip the package on your hard drive and start Zeppelin using the following command: Aurobindos-MacBook-Pro-2:zeppelin-0.6.2-bin-allaurobindosarkar$ bin/zeppelin-daemon.sh start You should see the following message: Zeppelin start [ OK] You should be able to see the Zeppelin home page at: http://localhost:8080/ Click on Create new note link, and specify a path and name for your notebook, as shown in the following figure: In the next step, we paste the same code as in the beginning of this article to create a DataFrame for our sample dataset. We can execute typical DataFrameoperations as shown in the following figure: Next, we create a table from our DataFrame and execute some SQL on it. The results of the SQL statements execution can be charted by clicking on the appropriatechart-type required. Here, we create bar charts as an illustrative example of summarizing and visualizing data: We can also plot a scatter plot, and read the coordinate values of each of the points plotted, as shown in the following two figures. Additionally, we can create a textbox that accepts input values to make experience interactive. In the following figure we create a textbox that can accept different values for the age parameter and the bar chart is updated, accordingly. Similarly, we can also create dropdown lists where the user can select the appropriate option, and the table of values or chart, automatically gets updated. Summary In this article, we demonstrated using Spark SQL for exploring datasets, performing basic data quality checks, generating samples and pivot tables, and visualizing data with Apache Zeppelin.

0
0
27233

article-image-how-to-set-up-a-deep-learning-system-on-amazon-web-services-aws

Gebin George

07 Mar 2018

5 min read

How to set up a Deep Learning System on Amazon Web Services (AWS)

Gebin George

07 Mar 2018

5 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials written by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book covers popular Python libraries such as Tensorflow, Keras, and more, along with tips to train, deploy and optimize deep learning models in the best possible manner.[/box] Today, we will learn two different methods of setting up a deep learning system using Amazon Web Services (AWS). Setup from scratch We will illustrate how to set up a deep learning environment on an AWS EC2 GPU instance g2.2xlarge running Ubuntu Server 16.04 LTS. For this example, we will use a pre-baked Amazon Machine Image (AMI) which already has a number of software packages installed—making it easier to set up an end-end deep learning system. We will use a publicly available AMI Image ami-b03ffedf, which has following pre-installed Packages: CUDA 8.0 Anaconda 4.20 with Python 3.0 Keras / Theano The first step to setting up the system is to set up an AWS account and spin a new EC2 GPU instance using the AWS web console as (http://console.aws.amazon.com/) shown in figure Choose EC2 AMI: 2. We pick a g2.2xlarge instance type from the next page as shown in figure Choose instance type: 3. After adding a 30 GB of storage as shown in figure Choose storage, we now launch a cluster and assign an EC2 key pair that can allow us to ssh and log in to the box using the provided key pair file: 4. Once the EC2 box is launched, next step is to install relevant software packages.To ensure proper GPU utilization, it is important to ensure graphics drivers are installed first. We will upgrade and install NVIDIA drivers as follows: $ sudo add-apt-repository ppa:graphics-drivers/ppa -y $ sudo apt-get update $ sudo apt-get install -y nvidia-375 nvidia-settings While NVIDIA drivers ensure that host GPU can now be utilized by any deep learning application, it does not provide an easy interface to application developers for easy programming on the device. Various different software libraries exist today that help achieve this task reliably. Open Computing Language (OpenCL) and CUDA are more commonly used in industry. In this book, we use CUDA as an application programming interface for accessing NVIDIA graphics drivers. To install CUDA driver, we first SSH into the EC2 instance and download CUDA 8.0 to our $HOME folder and install from there: $ wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-r epo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb $ sudo dpkg -i cuda-repo-ubuntu1604-8-0-local_8.0.44-1_amd64-deb $ sudo apt-get update $ sudo apt-get install -y cuda nvidia-cuda-toolkit Once the installation is finished, you can run the following command to validate the installation: $ nvidia-smi Now your EC2 box is fully configured to be used for a deep learning development. However, for someone who is not very familiar with deep learning implementation details, building a deep learning system from scratch can be a daunting task. To ease this development, a number of advanced deep learning software frameworks exist, such as Keras and Theano. Both of these frameworks are based on a Python development environment, hence we first install a Python distribution on the box, such as Anaconda: $ wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh $ bash Anaconda3-4.2.0-Linux-x86_64.sh Finally, Keras and Theanos are installed using Python’s package manager pip: $ pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git $ pip install keras Once the pip installation is completed successfully, the box is now fully set up for a deep learning development. Setup using Docker The previous section describes getting started from scratch which can be tricky sometimes given continuous changes to software packages and changing links on the web. One way to avoid dependence on links is to use container technology like Docker. In this chapter, we will use the official NVIDIA-Docker image that comes pre-packaged with all the necessary packages and deep learning framework to get you quickly started with deep learning application development: $ sudo add-apt-repository ppa:graphics-drivers/ppa -y $ sudo apt-get update $ sudo apt-get install -y nvidia-375 nvidia-settings nvidia-modprobe We now install Docker Community Edition as follows: $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - # Verify that the key fingerprint is 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 $ sudo apt-key fingerprint 0EBFCD88 $ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) Stable" $ sudo apt-get update $ sudo apt-get install -y docker-ce 2. We then install NVIDIA-Docker and its plugin: $ wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nv Idia-docker_1.0.1-1_amd64.deb $ sudo dpkg -i /tmp/nvidia-docker_1.0.1-1_amd64.deb && rm /tmp/nvidia-docker_1.0.1-1_amd64.deb 3. To validate if the installation happened correctly, we use the following command: $ sudo nvidia-docker run --rm nvidia/cuda nvidia-smi 4. Once it’s setup correctly, we can use the official TensorFlow or Theano Docker Image: $ sudo nvidia-docker run -it tensorflow/tensorflow:latest-gpu bash 5. We can run a simple Python program to check if TensorFlow works properly: import tensorflow as tf a = tf.constant(5, tf.float32) b = tf.constant(5, tf.float32) with tf.Session() as sess: sess.run(tf.add(a, b)) # output is 10.0 print("Output of graph computation is = ",output) You should see the TensorFlow output on the screen now as shown in figure Tensorflow sample output: We saw how to set up deep learning system on AWS from scratch and on Docker. If you found our post useful, do check out this book Deep Learning Essentials to optimize deep learning models for better performance output.

0
0
13127

article-image-aspnet-core-high-performance

Packt

07 Mar 2018

20 min read

ASP.NET Core High Performance

Packt

07 Mar 2018

20 min read

In this article by James Singleton, the author of the book ASP.NET Core High Performance, we will see that many things have changed for version 2 of the ASP.NET Core framework and there have also been a lot of improvements to the various supporting technologies. Now is a great time to give it a try, as the code has stabilized and the pace of change has settled down a bit. There were significant differences between the original release candidate and version 1 of ASP.NET Core and yet more alterations between version 1 and version 2. Some of these changes have been controversial, particularly around tooling but the scope of .NET Core has grown massively and ultimately this is a good thing. One of the highest profile differences between 1 and 2 is the change (and some would say regression) away from the new JavaScript Object Notation (JSON) based project format and back towards the Extensible Markup Language (XML) based .csproj format. However, it is a simplified and stripped down version compared to the format used in the original .NET Framework. There has been a move towards standardization between the different .NET frameworks and .NET Core 2 has a much larger API surface as a result. The interface specification known as .NET Standard 2 covers the intersection between .NET Core, the .NET Framework, and Xamarin. There is also an effort to standardize Extensible Application Markup Language (XAML) into the XAML Standard that will work across Universal Windows Platform (UWP) and Xamarin.Forms apps. C# and .NET can be used on a huge amount of platforms and in a large number of use cases, from server side web applications to mobile apps and even games using engines like Unity 3D. In this article we will go over the changes between version 1 and version 2 of the new Core releases. We will also look at some new features of the C# language. There are many useful additions and a plethora of performance improvement too. In this article we will cover: .NET Core 2 scope increases ASP.NET Core 2 additions Performance improvements .NET Standard 2 New C# 6 features New C# 7 features JavaScript considerations New in Core 2 There are two different products in the Core family. The first is .NET Core, which is the low level framework providing basic libraries. This can be used to write console applications and it is also the foundation for higher level application frameworks. The second is ASP.NET Core, which is a framework for building web applications that run on a server and service clients (usually web browsers). This was originally the only workload for .NET Core until it grew in scope to handle a more diverse range of scenarios. We'll cover the differences in the new versions separately for each of these frameworks. The changes in .NET Core will also apply to ASP.NET Core, unless you are running it on top of the .NET Framework version 4. New in .NET Core 2 The main focus of .NET Core 2 is the huge increase in scope. There are more than double the number of APIs included and it supports .NET Standard 2 (covered later in this article). You can also refer .NET Framework assemblies with no recompile required. This should just work as long as the assemblies only use APIs that have been implemented in .NET Core. This means that more NuGet packages will work with .NET Core. Finding if your favorite library was supported or not, was always a challenge with the previous version. The author set up a repository listing package compatibility to help with this. You can find the ASP.NET Core Library and Framework Support (ANCLAFS) list at github.com/jpsingleton/ANCLAFS and also via anclafs.com. If you want to make a change then please send a pull request. Hopefully in the future all packages will support Core and this list will no longer be required. There is now support in Core for Visual Basic and for more Linux distributions. You can also perform live unit testing with Visual Studio 2017, much like the old NCrunch extension. Performance improvements Some of the more interesting changes for 2 are the performance improvements over the original .NET Framework. There have been tweaks to the implementations of many of the framework data structures. Some of the classes and methods that have seen speed improvements or memory reduction include: List<T> Queue<T> SortedSet<T> ConcurrentQueue<T> Lazy<T> Enumerable.Concat() Enumerable.OrderBy() Enumerable.ToList() Enumerable.ToArray() DeflateStream SHA256 BigInteger BinaryFormatter Regex WebUtility.UrlDecode() Encoding.UTF8.GetBytes() Enum.Parse() DateTime.ToString() String.IndexOf() String.StartsWith() FileStream Socket NetworkStream SslStream ThreadPool SpinLock We won't go into specific benchmarks here because benchmarking is hard and the improvements you see will clearly depend on your usage. The thing to take away is that lots of work has been done to increase the performance of .NET Core, both over the previous version 1 and .NET Framework 4.7. Many of these changes have come from the community, which shows one of the benefits of open source development. Some of these advances will probably work their way back into a future version of the regular .NET Framework too. There have also been improvements to the RyuJIT compiler for .NET Core 2. As just one example, finally blocks are now almost as efficient as not using exception handing at all, in the normal situation where no exceptions are thrown. You now have no excuses not to liberally use try and using blocks, for example by having checked arithmetic to avoid integer overflows. New in ASP.NET Core 2 ASP.NET Core 2 takes advantage of all the improvements to .NET Core 2, if that is what you choose to run it on. It will also run on .NET Framework 4.7 but it's best to run it on .NET Core, if you can. With the increase in scope and support of .NET Core 2 this should be less of a problem than it was previously. It includes a new meta package so you only need to reference one NuGet item to get all the things! However, it is still composed of individual packages if you want to pick and choose. They haven't reverted back to the bad old days of one huge System.Web assembly. A new package trimming feature will ensure that if you don't use a package then its binaries won't be included in your deployment, even if you use the meta package to reference it. There is also a sensible default for setting up a web host configuration. You don't need to add logging, Kestrel, and IIS individually anymore. Logging has also got simpler and, as it is built in, you have no excuses not to use it from the start. A new feature is support for controller-less Razor Pages. These are exactly what they sound like and allow you to write pages with just a Razor template. This is similar to the Web Pages product, not to be confused with Web Forms. There is talk of Web Forms making a comeback, but if so then hopefully the abstraction will be thought out more and it won't carry so much state around with it. There is a new authentication model that makes better use of Dependency Injection. ASP.NET Core Identity allows you to use OpenID, OAuth 2 and get access tokens for your APIs. A nice time saver is you no longer need to emit anti-forgery tokens in forms (to prevent Cross Site Request Forgery) with attributes to validate them on post methods. This is all done automatically for you, which should prevent you forgetting to do this and leaving a security vulnerability. Performance improvements There have been additional increases to performance in ASP.NET Core that are not related to the improvements in .NET Core, which also help. Startup time has been reduced by shipping binaries that have already been through the Just In Time compilation process. Although not a new feature in ASP.NET Core 2, output caching is now available. In 1.0, only response caching was included, which simply set the correct HTTP headers. In 1.1, an in-memory cache was added and today you can use local memory or a distributed cache kept in SQL Server or Redis. Standards Standards are important, that's why we have so many of them. The latest version of the .NET Standard is 2 and .NET Core 2 implements this. A good way to think about .NET Standard is as an interface that a class would implement. The interface defines an abstract API but the concrete implementation of that API is left up to the classes that inherit from it. Another way to think about it is like the HTML5 standard that is supported by different web browsers. Version 2 of the .NET Standard was defined by looking at the intersection of the .NET Framework and Mono. This standard was then implemented by .NET Core 2, which is why is contains so many more APIs than version 1. Version 4.6.1 of the .NET Framework also implements .NET Standard 2 and there is work to support the latest versions of the .NET Framework, UWP and Xamarin (including Xamarin.Forms). There is also the new XAML Standard that aims to find the common ground between Xamarin.Forms and UWP. Hopefully it will include Windows Presentation Foundation (WPF) in the future. If you create libraries and packages that use these standards then they will work on all the platforms that support them. As a developer simply consuming libraries, you don't need to worry about these standards. It just means that you are more likely to be able to use the packages that you want, on the platforms you are working with. New C# features It not just the frameworks and libraries that have been worked on. The underlying language has also had some nice new features added. We will focus on C# here as it is the most popular language for the Common Language Runtime (CLR). Other options include Visual Basic and the functional programming language F#. C# is a great language to work with, especially when compared to a language like JavaScript. Although JavaScript is great for many reasons (such as its ubiquity and the number of frameworks available), the elegance and design of the language is not one of them. Many of these new features are just syntactic sugar, which means they don't add any new functionality. They simply provide a more succinct and easier to read way of writing code that does the same thing. C# 6 Although the latest version of C# is 7, there are some very handy features in C# 6 that often go underused. Also, some of the new additions in 7 are improvements on features added in 6 and would not make much sense without context. We will quickly cover a few features of C# 6 here, in case you are unaware of how useful they can be. String interpolation String interpolation is a more elegant and easier to work with version of the familiar string format method. Instead of supplying the arguments to embed in the string placeholders separately, you can now embed them directly in the string. This is far more readable and less error prone. Let us demonstrate with an example. Consider the following code that embeds an exception in a string. catch (Exception e) { Console.WriteLine("Oh dear, oh dear! {0}", e); } This embeds the first (and in this case only) object in the string at the position marked by the zero. It may seem simple but this quickly gets complex if you have many objects and want to add another at the start. You then have to correctly renumber all the placeholders. Instead you can now prefix the string with a dollar character and embed the object directly in it. This is shown in the following code that behaves the same as the previous example. catch (Exception e) { Console.WriteLine($"Oh dear, oh dear! {e}"); } The ToString() method on an exception outputs all the required information including name, message, stack trace and any inner exceptions. There is no need to deconstruct it manually, you may even miss things if you do. You can also use the same format strings as you are used to. Consider the following code that formats a date in a custom manner. Console.WriteLine($"Starting at: {DateTimeOffset.UtcNow:yyyy/MM/ddHH:mm:ss}"); When this feature was being built, the syntax was slightly different. So be wary of any old blog posts or documentation that may not be correct. Null conditional The null conditional operator is a way of simplifying null checks. You can now inline a check for null rather than using an if statement or ternary operator. This makes it easier to use in more places and will hopefully help you to avoid the dreaded null reference exception. You can avoid doing a manual null check like in the following code. int? length = (null == bytes) ? null : (int?)bytes.Length; This can now be simplified to the following statement by adding a question mark. int? length = bytes?.Length; Exception filters You can filter exceptions more easily with the when keyword. You no longer need to catch every type of exception that you are interested in and then filter manually inside the catch block. This is a feature that was already present in VB and F# so it's nice that C# has finally caught up. There are some small benefits to this approach. For example, if your filter is not matched then the exception can still be caught by other catch blocks in the same try statement. You also don't need to remember to re-throw the exception to avoid it being swallowed. This helps with debugging, as Visual Studio will no longer break, as it would when you throw. For example, you could check to see if there is a message in the exception and handle it differently, as shown here. catch (Exception e) when (e?.Message?.Length> 0) When this feature was in development, a different keyword (if) was used. So be careful of any old information online. One thing to keep in mind is that relying on a particular exception message is fragile. If your application is localized then the message may be in a different language than what you expect. This holds true outside of exception filtering too. Asynchronous availability Another small improvement is that you can use the await keyword inside catch and finally blocks. This was not initially allowed when this incredibly useful feature was added in C# 5. There is not a lot more to say about this. The implementation is complex but you don't need to worry about this unless you're interested in the internals. From a developer point of view, it just works, as in this trivial example. catch (Exception e) when (e?.Message?.Length> 0) { await Task.Delay(200); } This feature has been improved in C# 7, so read on. You will see async and await used a lot. Asynchronous programming is a great way of improving performance and not just from within your C# code. Expression bodies Expression bodies allow you to assign an expression to a method or getter property using the lambda arrow operator (=>) that you may be familiar with from fluent LINQ syntax. You no longer need to provide a full statement or method signature and body. This feature has also been improved in C# 7 so see the examples in the next section. For example, a getter property can be implemented like so. public static string Text => $"Today: {DateTime.Now:o}"; A method can be written in a similar way, such as the following example. private byte[] GetBytes(string text) => Encoding.UTF8.GetBytes(text); C# 7 The most recent version of the C# language is 7 and there are yet more improvements to readability and ease of use. We'll cover a subset of the more interesting changes here. Literals There are couple of minor additional capabilities and readability enhancements when specifying literal values in code. You can specify binary literals, which means you don't have to work out how to represent them using a different base anymore. You can also put underscores anywhere within a literal to make it easier to read the number. The underscores are ignored but allow you to separate digits into convention groupings. This is particularly well suited to the new binary literal as it can be very verbose listing out all those zeros and ones. Take the following example using the new 0b prefix to specify a binary literal that will be rendered as an integer in a string. Console.WriteLine($"Binary solo! {0b0000001_00000011_000000111_00001111}"); You can do this with other bases too, such as this integer, which is formatted to use a thousands separator. Console.WriteLine($"Over {9_000:#,0}!"); // Prints "Over 9,000!" Tuples One of the big new features in C# 7 is support for tuples. Tuples are groups of values and you can now return them directly from method calls. You are no longer restricted to returning a single value. Previously you could work around this limitation in a few sub-optimal ways, including creating a custom complex object to return, perhaps with a Plain Old C# Object (POCO) or Data Transfer Object (DTO), which are the same thing. You could have also passed in a reference using the ref or out keywords, which although there are improvements to the syntax are still not great. There was System.Tuple in C# 6 but this wasn't ideal. It was a framework feature, rather than a language feature and the items were only numbered and not named. With C# 7 tuples, you can name the objects and they make a great alternative to anonymous types, particularly in LINQ query expression lambda functions. As an example, if you only want to work on a subset of the data available, perhaps when filtering a database table with an O/RM such as Entity Framework, then you could use a tuple for this. The following example returns a tuple from a method. You may need to add the System.ValueTupleNuGet package for this to work. private static (int one, string two, DateTime three) GetTuple() { return (one: 1, two: "too", three: DateTime.UtcNow); } You can also use tuples in string interpolation and all the values are rendered, as shown here. Console.WriteLine($"Tuple = {GetTuple()}"); Out variables If you did want to pass parameters into a method for modification then you have always needed to declare them first. This is no longer necessary and you can simply declare the variables at the point you pass them in. You can also declare a variable to be discarded by using an underscore. This is particularly useful if you don't want to use the returned value, for example in some of the try parse methods of the native framework data types. Here we parse a date without declaring the dt variable first. DateTime.TryParse("2017-08-09", out var dt); In this example we test for an integer but we don't care what it is. var isInt = int.TryParse("w00t", out _); References You can now return values by reference from a method as well as consume them. This is a little like working with pointers in C but safer. For example, you can only return references that were passed into the method and you can't modify references to point to a different location in memory. This is a very specialist feature but in certain niche situations it can dramatically improve performance. Given the following method. private static ref string GetFirstRef(ref string[] texts) { if (texts?.Length> 0) { return ref texts[0]; } throw new ArgumentOutOfRangeException(); } You could call it like so, and the second console output line would appear differently (one instead of 1). var strings = new string[] { "1", "2" }; ref var first = ref GetFirstRef(ref strings); Console.WriteLine($"{strings?[0]}"); // 1 first = "one"; Console.WriteLine($"{strings?[0]}"); // one Patterns The other big addition is you can now match patterns in C# 7 using the is keyword. This simplifies testing for null and matching against types, among other things. It also lets you easily use the cast value. This is a simpler alternative to using full polymorphism (where a derived class can be treated as a base class and override methods). However, if you control the code base and can make use of proper polymorphism, then you should still do this and follow good Object-Oriented Programming (OOP) principles. In the following example, pattern matching is used to parse the type and value of an unknown object. private static int PatternMatch(object obj) { if (obj is null) { return 0; } if (obj is int i) { return i++; } if (obj is DateTime d || (obj is string str && DateTime.TryParse(str, out d))) { return d.DayOfYear; } return -1; } You can also use pattern matching in the cases of a switch statement and you can switch on non-primitive types such as custom objects. More expression bodies Expression bodies are expanded from the offering in C# 6 and you can now use them in more places, for example as object constructors and property setters. Here we extend our previous example to include setting the value on the property we were previously just reading. private static string text; public static string Text { get => text ?? $"Today: {DateTime.Now:r}"; set => text = value; } More asynchronous improvements There have been some small improvements to what async methods can return and, although small, they could offer big performance gains in certain situations. You no longer have to return a task, which can be beneficial if the value is already available. This can reduce the overheads of using async methods and creating a task object. JavaScript You can't write a book on web applications without covering JavaScript. It is everywhere. If you write a web app that does a full page load on every request and it's not a simple content site then it will feel slow. Users expect responsiveness. If you are a back-end developer then you may think that you don't have to worry about this. However, if you are building an API then you may want to make it easy to consume with JavaScript and you will need to make sure that your JSON is correctly and quickly serialized. Even if you are building a Single Page Application (SPA) in JavaScript (or TypeScript) that runs in the browser, the server can still play a key role. You can use SPA services to run Angular or React on the server and generate the initial output. This can increase performance, as the browser has something to render immediately. For example, there is a project called React.NET that integrates React with ASP.NET, and it supports ASP.NET Core. If you have been struggling to keep up with the latest developments in the .NET world then JavaScript is on another level. There seems to be something new almost every week and this can lead to framework fatigue and the paradox of choice. There is so much to choose from that you don't know what to pick. Summary In this article, you have seen a brief high-level summary of what has changed in .NET Core 2 and ASP.NET Core 2, compared to the previous versions. You are also now aware of .NET Standard 2 and what it is for. We have shown examples of some of the new features available in C# 6 and C# 7. These can be very useful in letting you write more with less, and in making your code more readable and easier to maintain.

0
0
15312

article-image-working-forensic-evidence-container-recipes

Packt

07 Mar 2018

13 min read

Working with Forensic Evidence Container Recipes

Packt

07 Mar 2018

13 min read

In this article by Preston Miller and Chapin Bryce, authors of Learning Python for Forensics, we introduce a recipe from our upcoming book, Python Digital Forensics Cookbook. In Python Digital Forensics Cookbook, each chapter is comprised of many scripts, or recipes, falling under specific themes. The "Iterating Through Files" recipe shown here, is from our chapter that introduces the Sleuth Kit's Python binding's, pystk3, and other libraries, to programmatically interact with forensic evidence containers. Specifically, this recipe shows how to access a forensic evidence container and iterate through all of its files to create an active file listing of its contents. (For more resources related to this topic, see here.) If you are reading this article, it goes without saying that Python is a key tool in DFIR investigations. However, most examiners, are not familiar with or do not take advantage of the Sleuth Kit's Python bindings. Imagine being able to run your existing scripts against forensic containers without needing to mount them or export loose files. This recipe continues to introduce the library, pytsk3, that will allow us to do just that and take our scripting capabilities to the next level. In this recipe, we learn how to recurse through the filesystem and create an active file listing. Oftentimes, one of the first questions we, as the forensic examiner, are asked is "What data is on the device?". An active file listing comes in handy here. Creating a file listing of loose files is a very straightforward task in Python. However, this will be slightly more complicated because we are working with a forensic image rather than loose files. This recipe will be a cornerstone for future scripts as it will allow us to recursively access and process every file in the image. As we continue to introduce new concepts and features from the Sleuth Kit, we will add new functionality to our previous recipes in an iterative process. In a similar way, this recipe will become integral in future recipes to iterate through directories and process files. Getting started Refer to the Getting started section in the Opening Acquisitions recipe for information on the build environment and setup details for pytsk3 and pyewf. All other libraries used in this script are present in Python's standard library. How to do it... We perform the following steps in this recipe: Import argparse, csv, datetime, os, pytsk3, pyewf, and sys; Identify if the evidence container is a raw (DD) image or an EWF (E01) container; Access the forensic image using pytsk3; Recurse through all directories in each partition; Store file metadata in a list; And write the active file list to a CSV. How it works... This recipe's command-line handler takes three positional arguments: EVIDENCE_FILE, TYPE, OUTPUT_CSV which represents the path to the evidence file, the type of evidence file, and the output CSV file, respectively. Similar to the previous recipe, the optional p switch can be supplied to specify a partition type. We use the os.path.dirname() method to extract the desired output directory path for the CSV file and, with the os.makedirs() function, create the necessary output directories if they do not exist. if __name__ == '__main__': # Command-line Argument Parser parser = argparse.ArgumentParser() parser.add_argument("EVIDENCE_FILE", help="Evidence file path") parser.add_argument("TYPE", help="Type of Evidence", choices=("raw", "ewf")) parser.add_argument("OUTPUT_CSV", help="Output CSV with lookup results") parser.add_argument("-p", help="Partition Type", choices=("DOS", "GPT", "MAC", "SUN")) args = parser.parse_args() directory = os.path.dirname(args.OUTPUT_CSV) if not os.path.exists(directory) and directory != "": os.makedirs(directory) Once we have validated the input evidence file by checking that it exists and is a file, the four arguments are passed to the main() function. If there is an issue with initial validation of the input, an error is printed to the console before the script exits. if os.path.exists(args.EVIDENCE_FILE) and os.path.isfile(args.EVIDENCE_FILE): main(args.EVIDENCE_FILE, args.TYPE, args.OUTPUT_CSV, args.p) else: print("[-] Supplied input file {} does not exist or is not a file".format(args.EVIDENCE_FILE)) sys.exit(1) In the main() function, we instantiate the volume variable with None to avoid errors referencing it later in the script. After printing a status message to the console, we check if the evidence type is an E01 to properly process it and create a valid pyewf handle as demonstrated in more detail in the Opening Acquisitions recipe. Refer to that recipe for more details as this part of the function is identical. The end result is the creation of the pytsk3 handle, img_info, for the user supplied evidence file. def main(image, img_type, output, part_type): volume = None print "[+] Opening {}".format(image) if img_type == "ewf": try: filenames = pyewf.glob(image) except IOError, e: print "[-] Invalid EWF format:n {}".format(e) sys.exit(2) ewf_handle = pyewf.handle() ewf_handle.open(filenames) # Open PYTSK3 handle on EWF Image img_info = ewf_Img_Info(ewf_handle) else: img_info = pytsk3.Img_Info(image) Next, we attempt to access the volume of the image using the pytsk3.Volume_Info() method by supplying it the image handle. If the partition type argument was supplied, we add its attribute ID as the second argument. If we receive an IOError when attempting to access the volume, we catch the exception as e and print it to the console. Notice, however, that we do not exit the script as we often do when we receive an error. We'll explain why in the next function. Ultimately, we pass the volume, img_info, and output variables to the openFS() method. try: if part_type is not None: attr_id = getattr(pytsk3, "TSK_VS_TYPE_" + part_type) volume = pytsk3.Volume_Info(img_info, attr_id) else: volume = pytsk3.Volume_Info(img_info) except IOError, e: print "[-] Unable to read partition table:n {}".format(e) openFS(volume, img_info, output) The openFS() method tries to access the filesystem of the container in two ways. If the volume variable is not None, it iterates through each partition, and if that partition meets certain criteria, attempts to open it. If, however, the volume variable is None, it instead tries to directly call the pytsk3.FS_Info() method on the image handle, img. As we saw, this latter method will work and give us filesystem access for logical images whereas the former works for physical images. Let's look at the differences between these two methods. Regardless of the method, we create a recursed_data list to hold our active file metadata. In the first instance, where we have a physical image, we iterate through each partition and check that is it greater than 2,048 sectors and does not contain the words "Unallocated", "Extended", or "Primary Table" in its description. For partitions meeting these criteria, we attempt to access its filesystem using the FS_Info() function by supplying the pytsk3 img object and the offset of the partition in bytes. If we are able to access the filesystem, we use to open_dir() method to get the root directory and pass that, along with the partition address ID, the filesystem object, two empty lists, and an empty string, to the recurseFiles() method. These empty lists and string will come into play in recursive calls to this function as we will see shortly. Once the recurseFiles() method returns, we append the active file metadata to the recursed_data list. We repeat this process for each partition def openFS(vol, img, output): print "[+] Recursing through files.." recursed_data = [] # Open FS and Recurse if vol is not None: for part in vol: if part.len > 2048 and "Unallocated" not in part.desc and "Extended" not in part.desc and "Primary Table" not in part.desc: try: fs = pytsk3.FS_Info(img, offset=part.start*vol.info.block_size) except IOError, e: print "[-] Unable to open FS:n {}".format(e) root = fs.open_dir(path="/") data = recurseFiles(part.addr, fs, root, [], [], [""]) recursed_data.append(data) We employ a similar method for the second instance, where we have a logical image, where the volume is None. In this case, we attempt to directly access the filesystem and, if successful, we pass that to the recurseFiles() method and append the returned data to our recursed_data list. Once we have our active file list, we send it and the user supplied output file path to the csvWriter() method. Let's dive into the recurseFiles() method which is the meat of this recipe. else: try: fs = pytsk3.FS_Info(img) except IOError, e: print "[-] Unable to open FS:n {}".format(e) root = fs.open_dir(path="/") data = recurseFiles(1, fs, root, [], [], [""]) recursed_data.append(data) csvWriter(recursed_data, output) The recurseFiles() function is based on an example of the FLS tool (https://github.com/py4n6/pytsk/blob/master/examples/fls.py) and David Cowen's Automating DFIR series tool dfirwizard (https://github.com/dlcowen/dfirwizard/blob/master/dfirwiza rd-v9.py). To start this function, we append the root directory inode to the dirs list. This list is used later to avoid unending loops. Next, we begin to loop through each object in the root directory and check that it has certain attributes we would expect and that its name is not either "." or "..". def recurseFiles(part, fs, root_dir, dirs, data, parent): dirs.append(root_dir.info.fs_file.meta.addr) for fs_object in root_dir: # Skip ".", ".." or directory entries without a name. if not hasattr(fs_object, "info") or not hasattr(fs_object.info, "name") or not hasattr(fs_object.info.name, "name") or fs_object.info.name.name in [".", ".."]: continue If the object passes that test, we extract its name using the info.name.name attribute. Next, we use the parent variable, which was supplied as one of the function's inputs, to manually create the file path for this object. There is no built-in method or attribute to do this automatically for us. We then check if the file is a directory or not and set the f_type variable to the appropriate type. If the object is a file, and it has an extension, we extract it and store it in the file_ext variable. If we encounter an AttributeError when attempting to extract this data we continue onto the next object. try: file_name = fs_object.info.name.name file_path = "{}/{}".format("/".join(parent), fs_object.info.name.name) try: if fs_object.info.meta.type == pytsk3.TSK_FS_META_TYPE_DIR: f_type = "DIR" file_ext = "" else: f_type = "FILE" if "." in file_name: file_ext = file_name.rsplit(".")[-1].lower() else: file_ext = "" except AttributeError: continue We create variables for the object size and timestamps. However, notice that we pass the dates to a convertTime() method. This function exists to convert the UNIX timestamps into a human-readable format. With these attributes extracted, we append them to the data list using the partition address ID to ensure we keep track of which partition the object is from size = fs_object.info.meta.size create = convertTime(fs_object.info.meta.crtime) change = convertTime(fs_object.info.meta.ctime) modify = convertTime(fs_object.info.meta.mtime) data.append(["PARTITION {}".format(part), file_name, file_ext, f_type, create, change, modify, size, file_path]) If the object is a directory, we need to recurse through it to access all of its sub-directories and files. To accomplish this, we append the directory name to the parent list. Then, we create a directory object using the as_directory() method. We use the inode here, which is for all intents and purposes a unique number and check that the inode is not already in the dirs list. If that were the case, then we would not process this directory as it would have already been processed. If the directory needs to be processed, we call the recurseFiles() method on the new sub_directory and pass it current dirs, data, and parent variables. Once we have processed a given directory, we pop that directory from the parent list. Failing to do this will result in false file path details as all of the former directories will continue to be referenced in the path unless removed. Most of this function was under a large try-except block. We pass on any IOError exception generated during this process. Once we have iterated through all of the subdirectories, we return the data list to the openFS() function. if f_type == "DIR": parent.append(fs_object.info.name.name) sub_directory = fs_object.as_directory() inode = fs_object.info.meta.addr # This ensures that we don't recurse into a directory # above the current level and thus avoid circular loops. if inode not in dirs: recurseFiles(part, fs, sub_directory, dirs, data, parent) parent.pop(-1) except IOError: pass dirs.pop(-1) return data Let's briefly look at the convertTime() function. We've seen this type of function before, if the UNIX timestamp is not 0, we use the datetime.utcfromtimestamp() method to convert the timestamp into a human-readable format. def convertTime(ts): if str(ts) == "0": return "" return datetime.utcfromtimestamp(ts) With the active file listing data in hand, we are now ready to write it to a CSV file using the csvWriter() method. If we did find data (i.e., the list is not empty), we open the output CSV file, write the headers, and loop through each list in the data variable. We use the csvwriterows() method to write each nested list structure to the CSV file. def csvWriter(data, output): if data == []: print "[-] No output results to write" sys.exit(3) print "[+] Writing output to {}".format(output) with open(output, "wb") as csvfile: csv_writer = csv.writer(csvfile) headers = ["Partition", "File", "File Ext", "File Type", "Create Date", "Modify Date", "Change Date", "Size", "File Path"] csv_writer.writerow(headers) for result_list in data: csv_writer.writerows(result_list) The screenshot below demonstrates the type of data this recipe extracts from forensic images. There's more... For this recipe, there are a number of improvements that could further increase its utility: Use tqdm, or another library, to create a progress bar to inform the user of the current execution progress. Learn about the additional metadata values that can be extracted from filesystem objects using pytsk3 and add them to the output CSV file. Summary In summary, we have learned how to use pytsk3 to recursively iterate through any supported filesystem by the Sleuth Kit. This comprises the basis of how we can use the Sleuth Kit to programmatically process forensic acquisitions. With this recipe, we will now be able to further interact with these files in future recipes. Resources for Article: Further resources on this subject: [article] [article] [article]

0
0
16190

How-To Tutorials

article-image-administering-arcgis-enterprise-through-rest-administrative-directories

Chad Cooper

07 Mar 2018

8 min read

Administering ARCGIS Enterprise through the REST administrative directories

Chad Cooper

07 Mar 2018

8 min read

This is a guest post written by Chad Cooper. Chad has worked in the geospatial industry over the last 15 years as a technician, analyst, and developer, pertaining to state and local government, oil and gas, and academia. He is also the author of the title Mastering ArcGIS Enterprise Administration, which aims to help you learn to install configure, secure, and fully utilize ArcGIS Enterprise system. ArcGIS Enterprise is one of the most widely used GIS packages in the world. With the 10.5 release, Portal for ArcGIS became a first-class citizen, living alongside ArcGIS Server and playing a major role in management and administration of the web GIS. Data Store for ArcGIS allows for local storage of hosted feature services and is also a major player in the ArcGIS Enterprise ecosystem. The ArcGIS Web Adaptor completes ArcGIS Enterprise and is the fourth major component. These components are new to most users (Portal and Data Store), and they come with an increased level of configuration, complexity and administration. Luckily, there are many ways to administer and manage the ArcGIS Enterprise system. In this article, we will look at a few of those methods. How to access the ArcGIS server REST administrator directory ArcGIS Server exposes its functionality through web services using REST. With this architecture comes the ArcGIS Server REST Application Programming Interface, or API, that, in addition to exposing ArcGIS Server services, exposes every administrative task that ArcGIS Server supports. In the API, ArcGIS Server administrative tasks are considered resources and are accessed through URLs (which are Uniform Resource Locators, after all). Operations act on these resources and update their information or state. Resources and their operations are hierarchical and standardized and have unique URLs. Like the web, the REST API is stateless, meaning that it does not retain information from one request to another by either the sender or receiver. Each request that is sent is expected to contain all the necessary information to process that request. If it does, the server processes the request and sends back a well-defined response. As it is accessed over the web, the ArcGIS Server REST API can also be invoked from any language that can make a web service call, such as Python. Accessing the ArcGIS Server Administrator Directory can be done in several ways, depending upon your Web Adaptor configuration. From the ArcGIS Server machine, the Server Administrator Directory can be accessed at https://localhost:6443/arcgis/admin. There is no shortcut to this URL in the Windows Start menu. From another machine on the internal network, the Server Administrator Directory can be accessed by using the fully qualified domain name, or FQDN, instead of localhost, such as https://server.domain.com:6443/arcgis/admin. If, during your Web Adaptor configuration, you chose to Enable administrative access to your site through the Web Adaptor, you also will be able to access the Server Administrator Directory through your Web Adaptor URL, such as https://www.masteringageadmin.com/arcgis/admin. As with Server Manager, you will login as the primary site administrator (PSA) designated during installation or with other administrator credentials. Prior to ArcGIS 10.1, server configuration was held in plain text configuration files in the configuration store. These files are no longer part of the ArcGIS Server architecture. The ArcGIS Server REST Administrator Directory now exposes these settings. How to use the ArcGIS server REST administrator directory The ArcGIS Server REST Administrator Directory, or “REST Admin” as it will be herein referred to, is a powerful way to manage all aspects of ArcGIS Server administration, as it exposes every administrative task that ArcGIS Server supports. Remember from earlier that the API is organized into resources and operations. Resources are settings within ArcGIS Server and operations act on those resources to update their information or change their well-defined state usually through a HTTP GET or POST method. HTTP GET requests data from a resource while HTTP POST submits data to be processed to a resource. In other words, GET retrieves data, POST inserts/updates data. An example of a resource is a service. An existing service can have a well-defined state of stopped or started, it must be one or the other. Operations available on the service resource in the REST API include Start Service, Stop Service, Edit Service, and Delete Service. The Start, Stop, and Delete operations change the state of the service, from stopped to started and started to stopped, and either stopped or started to deleted (technically if the service is started, it is first stopped before it is deleted) respectively. The Edit Service operation changes the information in the resource. Resources can also have child resources which can in turn have their own set of operations and child resources. Remember that the API is hierarchical, so for example, in the case of a service resource, it has the child resource Item Information, which has the Edit Item Information operation. To get to this operation in the REST Admin, we would login to the REST Admin and go to services | | iteminfo | edit, which would resemble the following in URL form: https://www.masteringageadmin.com/arcgis/admin/services/SampleWorldCities.MapServer/iteminfo/edit In the REST Admin, we could now edit the service Description, Summary, Tags, and Thumbnail: By updating the Item Information in the above example and clicking the Update button, we would be sending an edit HTTP POST operation to the https://www.masteringageadmin.com/arcgis/admin/services/SampleWorldCities.MapServer/iteminfo resource. The ArcGIS Server Manager equivalent for this process would be to go to Services | Manage Services | Edit Service pencil button to the right of service name | Item Description. Hopefully this gives you a better understanding of how the REST API works and how actions carried out in Server Manager and Server are executed by the API on the backend. Administering Portal for ArcGIS through the Portal REST administrative directory Just like ArcGIS Server, Portal has a REST backend from which all administrative tasks can be performed. We previously covered how the web interface for ArcGIS Server is a frontend to the ArcGIS Server REST API, and Portal is no different. We also covered services and how REST calls are made to the API. The Portal Administrative Directory, referred to herein as “Portal Admin”, can be accessed from within the internal network (bypassing the Web Adaptor) at a URL such as: https://<FQDN>:7443/arcgis/portaladmin/ If administrative access is enabled on the Portal Web Adaptor, then we can access Portal Admin outside of our internal network at the Web Adaptor URL such as: https://www.your-domain.com/portal/portaladmin/ To login to Portal Admin as an administrator, enter the Username and Password of an account with administrator privileges at the Portal Administrative Directory Login page and click the Login button. Let’s now look at one administrative action that can be performed in the Portal REST Admin. Portal licensing Information on current Portal licensing can be viewed by going to Home | System | Licenses. Here, information on the validity and expiration of licensing and registered members can be viewed. The Import Entitlements operation allows for the import of entitlements for ArcGIS Pro and additional products such as Business Analyst or Insights. For ArcGIS Pro, the operation requires an entitlements file exported out of My Esri. Once the entitlements have been imported, licenses can be assigned to users within Portal. Entitlements can have effective parts and parts that become effective on a certain date. These all get imported, with the effective parts available immediately and the non-effective parts placed into a queue that Portal will automatically apply once they become effective. To import entitlements for ArcGIS Pro, do the following: Have your entitlements file ready In Portal Admin, go to Home | System | Licenses | Import Entitlements Choose your entitlements file under Choose File For Application, choose ArcGISPro For Format, choose JSON or HTML (this is only the response format) Click Import. Once the entitlements are imported, the licenses can be assigned to users in Portal under My Organization | Manage Licenses. At its latest release, ArcGIS Enterprise has more components than ever before, resulting in additional setup, configuration, administration, and management requirements. Here, we looked at several ways to access the ArcGIS Server and Portal for ArcGIS REST administrative interfaces. These are a few of the many methods available to interact with your ArcGIS Enterprise system. Check out Mastering ArcGIS Enterprise Administration to learn how to administer ArcGIS Server, Portal, and Data Store through user interfaces, the REST API, and Python scripts.

0
0
3036

How-To Tutorials

article-image-implementing-matrix-operations-using-scipy-numpy

Pravin Dhandre

07 Mar 2018

5 min read

Implementing matrix operations using SciPy and NumPy

Pravin Dhandre

07 Mar 2018

5 min read

0
0
71101

article-image-introduction-aspnet-core-web-api

Packt

07 Mar 2018

13 min read

Introduction to ASP.NET Core Web API

Packt

07 Mar 2018

13 min read

In this article by MithunPattankarand MalendraHurbuns, the authors of the book, Mastering ASP.NET Web API,we will start with a quick recap of MVC. We will be looking at the following topics: Quick recap of MVC framework Why Web APIs were incepted and it's evolution? Introduction to .NET Core? Overview of ASP.NET Core architecture (For more resources related to this topic, see here.) Quick recap of MVC framework Model-View-Controller (MVC) is a powerful and elegant way of separating concerns within an application and applies itself extremely well to web applications. With ASP.NETMVC, it's translated roughly as follows: Models (M): These are the classes that represent the domain you are interested in. These domain objects often encapsulate data stored in a database as well as code that manipulates the data and enforces domain-specific business logic. With ASP.NETMVC, this is most likely a Data Access Layer of some kind, using a tool like Entity Framework or NHibernate or classic ADO.NET. View (V): This is a template to dynamically generate HTML. Controller(C): This is a special class that manages the relationship between the View and the Model. It responds to user input, talks to the Model, and decides which view to render (if any). In ASP.NETMVC, this class is conventionally denoted by the suffix Controller. Why Web APIs were incepted and it's evolution? Looking back to days when ASP.NETASMX-based XML web service was widely used for building service-oriented applications, it was easiest way to create SOAP-based service which can be used by both .NET applications and non .NET applications. It was available only over HTTP. Around 2006, Microsoft released Windows Communication Foundation (WCF).WCF was and even now a powerful technology for building SOA-based applications. It was giant leap in the world of Microsoft .NET world. WCF was flexible enough to be configured as HTTP service, Remoting service, TCP service, and so on. Using Contracts of WCF, we would keep entire business logic code base same and expose the service as HTTP based or non HTTP based via SOAP/ non SOAP. Until 2010 the ASMX based XML web service or WCF service were widely used in client server based applications, in-fact everything was running smoothly. But the developers of .NET or non .NET community started to feel need for completely new SOA technology for client server applications. Some of reasons behind them were as follows: With applications in production, the amount of data while communicating started to explode and transferring them over the network was bandwidth consuming. SOAP being light weight to some extent started to show signs of payload increase. A few KB SOAP packets were becoming few MBs of data transfer. Consuming the SOAP service in applications lead to huge applications size because of WSDL and proxy generation. This was even worse when it was used in web applications. Any changes to SOAP services lead to repeat of consuming them by proxy generation. This wasn't easy task for any developers. JavaScript-based web frameworks were getting released and gaining ground for much simpler way of web development. Consuming SOAP-based services were not that optimal way. Hand-held devices were becoming popular like tablets, smartphones. They had more focused applications and needed very lightweight service oriented approach. Browser based Single Page Applications (SPA) was gaining ground very rapidly. Using SOAP based services for quite heavy for these SPA. Microsoft released REST based WCF components which can be configured to respond in JSON or XML, but even then it was WCF which was heavy technology to be used. Applications where no longer just large enterprise services, but there was need was more focused light weight service to be up & running in few days and much easier to use. Any developer who has seen evolving nature of SOA based technologies like ASMX, WCF or any SOAP based felt the need to have much lighter, HTTP based services. HTTP only, JSON compatible POCO based lightweight services was need of the hour and concept of Web API started gaining momentum. What is Web API? A Web API is a programmatic interface to a system that is accessed via standard HTTP methods and headers. A Web API can be accessed by a variety of HTTP clients, including browsers and mobile devices. For Web API to be successful HTTP based service, it needed strong web infrastructure like hosting, caching, concurrency, logging, security etc. One of the best web infrastructure was none other than ASP.NET. ASP.NET either in form Web Form or MVC was widely adopted, so the solid base for web infrastructure was mature enough to be extended as Web API. Microsoft responded to community needs by creating ASP.NET Web API- a super-simple yet very powerful framework for building HTTP-only, JSON-by-default web services without all the fuss of WCF. ASP.NET Web API can be used to build REST based services in matter of minutes and can easily consumed with any front end technologies. It used IIS (mostly) for hosting, caching, concurrency etc. features, it became quite popular. It was launched in 2012 with most basics needs for HTTP based services like convention-based Routing, HTTP Request and Response messages. Later Microsoft released much bigger and better ASP.NET Web API 2 along with ASP.NETMVC 5 in Visual Studio 2013. ASP.NET Web API 2 evolved at much faster pace with these features. Installed via NuGet Installing of Web API 2 was made simpler by using NuGet, either create empty ASP.NET or MVC project and then run command in NuGet Package Manager Console: Install-Package Microsoft.AspNet.WebApi Attribute Routing Initial release of Web API was based on convention-based routing meaning we define one or more route templates and work around it. It's simple without much fuss as routing logic in a single place & it's applied across all controllers. The real world applications are more complicated with resources (controllers/ actions) have child resources like customers having orders, books having authors etc. In such cases convention-based routing is not scalable. Web API 2 introduced a new concept of Attribute Routing which uses attributes in programming languages to define routes. One straight forward advantage is developer has full controls how URIs for Web API are formed. Here is quick snippet of Attribute Routing: [Route("customers/{customerId}/orders")] public IEnumerable<Order>GetOrdersByCustomer(intcustomerId) { ... } For more understanding on this, read Attribute Routing in ASP.NET Web API 2(https://www.asp.net/web-api/overview/web-api-routing-and-actions/attribute-routing-in-web-api-2) OWIN self-host ASP.NET Web API lives on ASP.NET framework, leading to think that it can be hosted on IIS only. The Web API 2 came new hosting package. Microsoft.AspNet.WebApi.OwinSelfHost With this package it can self-hosted outside IIS using OWIN/Katana. CORS (Cross Origin Resource Sharing) Any Web API developed either using .NET or non .NET technologies and meant to be used across different web frameworks, then enabling CORS is must. A must read on CORS&ASP.NET Web API 2 (https://www.asp.net/web-api/overview/security/enabling-cross-origin-requests-in-web-api). IHTTPActionResult and Web API OData improvements are other few notable features which helped evolve Web API 2 as strong technology for developing HTTP based services. ASP.NET Web API 2 has becoming more powerful over the years with C# language improvements like Asynchronous programming using Async/ Await, LINQ, Entity Framework Integration, Dependency Injection with DI frameworks, and so on. ASP.NET into Open Source world Every technology has to evolve with growing needs and advancements in hardware, network and software industry, ASP.NET Web API is no exception to that. Some of the evolution that ASP.NET Web API should undergo from perspectives of developer community, enterprises and end users are: ASP.NETMVC and Web API even though part of ASP.NET stack but their implementation and code base is different. A unified code base reduces burden of maintaining them. It's known that Web API's are consumed by various clients like web applications, Native apps, and Hybrid apps, desktop applications using different technologies (.NET or non .NET). But how about developing Web API in cross platform way, need not rely always on Windows OS/ Visual Studio IDE. Open sourcing the ASP.NET stack so that it's adopted on much bigger scale. End users are benefitted with open source innovations. We saw that why Web APIs were incepted, how they evolved into powerful HTTP based service and some evolutions required. With these thoughts Microsoft made an entry into world of Open Source by launching .NET Core and ASP.NET Core 1.0. What is .NET Core? .NET Core is a cross-platform free and open-source managed software framework similar to .NET Framework. It consists of CoreCLR, a complete cross-platform runtime implementation of CLR. .NET Core 1.0 was released on 27 June 2016 along with Visual Studio 2015 Update 3, which enables .NET Core development. In much simpler terms .NET Core applications can be developed, tested, deployed on cross platforms such as Windows, Linux flavours, macOS systems. With help of .NET Core, we don't really need Windows OS and in particular Visual Studio IDE to develop ASP.NET web applications, command-line apps, libraries, and UWP apps. In short, let's understand .NET Core components: CoreCLR:It is a virtual machine that manages the execution of .NET programs. CoreCLRmeans Core Common Language Runtime, it includes the garbage collector, JIT compiler, base .NET data types and many low-level classes. CoreFX: .NET Core foundational libraries likes class for collections, file systems, console, XML, Async and many others. CoreRT: .NET Core runtime optimized for AOT (ahead of time compilation) scenarios, with the accompanying .NET Native compiler toolchain. Its main responsibility is to do native compilation of code written in any of our favorite .NET programming language. .NET Core shares subset of original .NET framework, plus it comes with its own set of APIs that is not part of .NET framework. This results in some shared APIs that can be used by both .NET core and .NET framework. A .Net Core application can easily work on existing .NET Framework but not vice versa. .NET Core provides a CLI (Command Line Interface) for an execution entry point for operating systems and provides developer services like compilation and package management. The following are the .NET Core interesting points to know: .NET Core can be installed on cross platforms like Windows, Linux, andmacOS. It can be used in device, cloud, and embedded/IoT scenarios. Visual Studio IDE is not mandatory to work with .NET Core, but when working on Windows OS we can leverage existing IDE knowledge to work. .NET Core is modular, meaning that instead of assemblies, developers deal with NuGet packages. .NET Core relies on its package manager to receive updates because cross platform technology can't rely on Windows Updates. To learn .NET Core, we just need a shell, text editor and its runtime installed. .NET Core comes with flexible deployment. It can be included in your app or installed side-by-side user- or machine-wide. .NET Core apps can also be self-hosted/run as standalone apps. .NET Core supports four cross-platform scenarios--ASP.NET Core web apps, command-line apps, libraries, and Universal Windows Platform apps. It does not implement Windows Forms or WPF which render the standard GUI for desktop software on Windows. At present only C# programming language can be used to write .NET Core apps. F# and VB support are on the way. We will primarily focus on ASP.NET Core web apps which includes MVC and Web API. CLI apps, libraries will be covered briefly. What is ASP.NET Core? A new open-source and cross-platform framework for building modern cloud-based web applications using .NET. ASP.NET Core is completely open-source, you can download it from GitHub. It's cross platform meaning you can develop ASP.NET Core apps on Linux/macOS and of course on Windows OS. ASP.NET was first released almost 15 years back with .NET framework. Since then it's adopted by millions of developers for large, small applications. ASP.NET has evolved with many capabilities. With .NET Core as cross platform, ASP.NET took a huge leap beyond boundaries of Windows OS environment for development and deployment of web applications. ASP.NET Core overview ASP.NET Core Architecture overview ASP.NET Core high level overview provides following insights: ASP.NET Core runs both on Full .NET framework and .NET Core. ASP.NET Core applications with full .NET framework can only be developed and deployed only Windows OS/Server. When using .NET core, it can be developed and deployed on platform of choice. The logos of Windows, Linux, macOSindicates that you can work with ASP.NET Core. ASP.NET Core when on non-Windows machine, use the .NET Core libraries to run the applications. It's obvious you won't have all full .NET libraries but most of them are available. Developers working on ASP.NET Core can easily switch working on any machine not confined to Visual Studio 2015 IDE. ASP.NET Core can run with different version of .NET Core. ASP.NET Core has much more foundational improvements apart from being cross-platform, we gain following advantages of using ASP.NET Core: Totally Modular: ASP.NET Core takes totally modular approach for application development, every component needed to build application are well factored into NuGet packages. Only add required packages through NuGet to keep overall application lightweight. ASP.NET Core is no longer based on System.Web.dll. Choose your editors and tools: Visual Studio IDE was used to develop ASP.NET applications on Windows OS box, now since we have moved beyond the Windows world. Then we will require IDE/editors/ Tools required for developingASP.NET applications on Linux/macOS. Microsoft developed powerful lightweight code editors for almost any type of web applications called as Visual Studio Code. ASP.NET Core is such a framework that we don't need Visual Studio IDE/ code to develop applications. We can use code editors like Sublime, Vim also. To work with C# code in editors, installed and use OmniSharp plugin. OmniSharp is a set of tooling, editor integrations and libraries that together create an ecosystem that allows you to have a great programming experience no matter what your editor and operating system of choice may be. Integration with modern web frameworks: ASP.NET Core has powerful, seamless integration with modern web frameworks like Angular, Ember, NodeJS, and Bootstrap. Using bower andNPM, we can work with modern web frameworks. Cloud ready: ASP.NET Core apps are cloud ready with configuration system, it just seamlessly gets transitioned from on-premises to cloud. Built in Dependency Injection. Can be hosted on IIS or self-host in your own process or on nginx. New light-weight and modular HTTP request pipeline. Unified code base for Web UI and Web APIs. We will see more on this when we explore anatomy of ASP.NET Core application. Summary So in this article we covered MVC framework and introduced .NET Core and its architecture. Resources for Article: Further resources on this subject: [article] [article] [article]

0
0
12008

How-To Tutorials

article-image-logistic-regression-using-tensorflow

Packt

06 Mar 2018

9 min read

Logistic Regression Using TensorFlow

Packt

06 Mar 2018

9 min read

In this article, by PKS Prakash and Achyutuni Sri Krishna Rao, authors of R Deep Learning Cookbook we will learn how to Perform logistic regression using TensorFlow. In this recipe, we will cover the application of TensorFlow in setting up a logistic regression model. The example will use a similar dataset to that used in the H2O model setup. (For more resources related to this topic, see here.) What is TensorFlow TensorFlow is another open source library developed by the Google Brain Team to build numerical computation models using data flow graphs. The core of TensorFlow was developed in C++ with the wrapper in Python. The tensorflow package in R gives you access to the TensorFlow API composed of Python modules to execute computation models. TensorFlow supports both CPU- and GPU-based computations. The tensorflow package in R calls the Python tensorflow API for execution, which is essential to install the tensorflow package in both R and Python to make R work. The following are the dependencies for tensorflow: Python 2.7 / 3.x R (>3.2) devtools package in R for installing TensorFlow from GitHub TensorFlow in Python pip Getting ready The code for this section is created on Linux but can be run on any operating system. To start modeling, load the tensorflow package in the environment. R loads the default TensorFlow environment variable and also the NumPy library from Python in the np variable: library("tensorflow") # Load TensorFlow np <- import("numpy") # Load numpy library How to do it... The data is imported using a standard function from R, as shown in the following code. The data is imported using the read.csv file and transformed into the matrix format followed by selecting the features used to model as defined in xFeatures and yFeatures. The next step in TensorFlow is to set up a graph to run optimization: # Loading input and test data xFeatures = c("Temperature", "Humidity", "Light", "CO2", "HumidityRatio") yFeatures = "Occupancy" occupancy_train <- as.matrix(read.csv("datatraining.txt",stringsAsFactors = T)) occupancy_test <- as.matrix(read.csv("datatest.txt",stringsAsFactors = T)) # subset features for modeling and transform to numeric values occupancy_train<-apply(occupancy_train[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) occupancy_test<-apply(occupancy_test[, c(xFeatures, yFeatures)], 2, FUN=as.numeric) # Data dimensions nFeatures<-length(xFeatures) nRow<-nrow(occupancy_train) Before setting up the graph, let's reset the graph using the following command: # Reset the graph tf$reset_default_graph() Additionally, let's start an interactive session as it will allow us to execute variables without referring to the session-to-session object: # Starting session as interactive session sess<-tf$InteractiveSession() Define the logistic regression model in TensorFlow: # Setting-up Logistic regression graph x <- tf$constant(unlist(occupancy_train[, xFeatures]), shape=c(nRow, nFeatures), dtype=np$float32) # W <- tf$Variable(tf$random_uniform(shape(nFeatures, 1L))) b <- tf$Variable(tf$zeros(shape(1L))) y <- tf$matmul(x, W) + b The input feature x is defined as a constant as it will be an input to the system. The weight W and bias b are defined as variables that will be optimized during the optimization process. The y is set up as a symbolic representation between x, W, and b. The weight W is set up to initialize random uniform distribution and b is assigned the value zero. The next step is to set up the cost function for logistic regression: # Setting-up cost function and optimizer y_ <- tf$constant(unlist(occupancy_train[, yFeatures]), dtype="float32", shape=c(nRow, 1L)) cross_entropy<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=y_, logits=y, name="cross_entropy")) optimizer <- tf$train$GradientDescentOptimizer(0.15)$minimize(cross_entr opy) # Start a session init <- tf$global_variables_initializer() sess$run(init) Execute the gradient descent algorithm for the optimization of weights using cross entropy as the loss function: # Running optimization for (step in 1:5000) { sess$run(optimizer) if (step %% 20== 0) cat(step, "-", sess$run(W), sess$run(b), "==>", sess$run(cross_entropy), "n") } How it works... The performance of the model can be evaluated using AUC: # Performance on Train library(pROC) ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) # Performance on test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)). AUC can be visualized using the plot.auc function from the pROC package, as shown in the screenshot following this command. The performance for training and testing (holdout) is very similar. plot.roc(roc_obj, col = "green", lty=2, lwd=2) plot.roc(roc_objt, add=T, col="red", lty=4, lwd=2) Performance of logistic regression using TensorFlow Visualizing TensorFlow graphs TensorFlow graphs can be visualized using TensorBoard. It is a service that utilizes TensorFlow event files to visualize TensorFlow models as graphs. Graph model visualization in TensorBoard is also used to debug TensorFlow models. Getting ready TensorBoard can be started using the following command in the terminal: $ tensorboard --logdir home/log --port 6006 The following are the major parameters for TensorBoard: --logdir : To map to the directory to load TensorFlow events --debug: To increase log verbosity --host: To define the host to listen to its localhost (127.0.0.1) by default --port: To define the port to which TensorBoard will serve The preceding command will launch the TensorFlow service on localhost at port 6006, as shown in the following screenshot: TensorBoard The tabs on the TensorBoard capture relevant data generated during graph execution. How to do it... The section covers how to visualize TensorFlow models and output in TernsorBoard. To visualize summaries and graphs, data from TensorFlow can be exported using the FileWriter command from the summary module. A default session graph can be added using the following command: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) The graph for logistic regression developed using the preceding code is shown in the following screenshot: Visualization of the logistic regression graph in TensorBoard Similarly, other variable summaries can be added to the TensorBoard using correct summaries, as shown in the following code: # Adding histogram summary to weight and bias variable w_hist = tf$histogram_summary("weights", W) b_hist = tf$histogram_summary("biases", b) Create a cross entropy evaluation for test. An example script to generate the cross entropy cost function for test and train is shown in the following command: # Set-up cross entropy for test nRowt<-nrow(occupancy_test) xt <- tf$constant(unlist(occupancy_test[, xFeatures]), shape=c(nRowt, nFeatures), dtype=np$float32) ypredt <- tf$nn$sigmoid(tf$matmul(xt, W) + b) yt_ <- tf$constant(unlist(occupancy_test[, yFeatures]), dtype="float32", shape=c(nRowt, 1L)) cross_entropy_tst<- tf$reduce_mean(tf$nn$sigmoid_cross_entropy_with_logits(labe ls=yt_, logits=ypredt, name="cross_entropy_tst")) Add summary variables to be collected: # Add summary ops to collect data w_hist = tf$summary$histogram("weights", W) b_hist = tf$summary$histogram("biases", b) crossEntropySummary<-tf$summary$scalar("costFunction", cross_entropy) crossEntropyTstSummary<- tf$summary$scalar("costFunction_test", cross_entropy_tst) Open the writing object, log_writer. It writes the default graph to the location, c:/log: # Create Writer Obj for log log_writer = tf$summary$FileWriter('c:/log', sess$graph) Run the optimization and collect the summaries: for (step in 1:2500) { sess$run(optimizer) # Evaluate performance on training and test data after 50 Iteration if (step %% 50== 0){ ### Performance on Train ypred <- sess$run(tf$nn$sigmoid(tf$matmul(x, W) + b)) roc_obj <- roc(occupancy_train[, yFeatures], as.numeric(ypred)) ### Performance on Test ypredt <- sess$run(tf$nn$sigmoid(tf$matmul(xt, W) + b)) roc_objt <- roc(occupancy_test[, yFeatures], as.numeric(ypredt)) cat("train AUC: ", auc(roc_obj), " Test AUC: ", auc(roc_objt), "n") # Save summary of Bias and weights log_writer$add_summary(sess$run(b_hist), global_step=step) log_writer$add_summary(sess$run(w_hist), global_step=step) log_writer$add_summary(sess$run(crossEntropySummary), global_step=step) log_writer$add_summary(sess$run(crossEntropyTstSummary), global_step=step) } } Collect all the summaries to a single tensor using the merge_all command from the summary module: summary = tf$summary$merge_all() Write the summaries to the log file using the log_writer object: log_writer = tf$summary$FileWriter('c:/log', sess$graph) summary_str = sess$run(summary) log_writer$add_summary(summary_str, step) log_writer$close() Summary In this article, we have learned how to perform logistic regression using TensorFlow also we have covered the application of TensorFlow in setting up a logistic regression model. Resources for Article: Further resources on this subject: [article] [article] [article]

0
0
4403

article-image-implement-long-short-term-memory-lstm-tensorflow

Gebin George

06 Mar 2018

4 min read

Implement Long-short Term Memory (LSTM) with TensorFlow

Gebin George

06 Mar 2018

4 min read

[box type="note" align="" class="" width=""]This article is an excerpt from the book, Deep Learning Essentials written by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book will help you get started with the essentials of deep learning and neural network modeling.[/box] In today’s tutorial, we will look at an example of using LSTM in TensorFlow to perform sentiment classification. The input to LSTM will be a sentence or sequence of words. The output of LSTM will be a binary value indicating a positive sentiment with 1 and a negative sentiment with 0. We will use a many-to-one LSTM architecture for this problem since it maps multiple inputs onto a single output. Figure LSTM: Basic cell architecture shows this architecture in more detail. As shown here, the input takes a sequence of word tokens (in this case, a sequence of three words). Each word token is input at a new time step and is input to the hidden state for the corresponding time step. For example, the word Book is input at time step t and is fed to the hidden state ht: Sentiment analysis: To implement this model in TensorFlow, we need to first define a few variables as follows: batch_size = 4 lstm_units = 16 num_classes = 2 max_sequence_length = 4 embedding_dimension = 64 num_iterations = 1000 As shown previously, batch_size dictates how many sequences of tokens we can input in one batch for training. lstm_units represents the total number of LSTM cells in the network. max_sequence_length represents the maximum possible length of a given sequence. Once defined, we now proceed to initialize TensorFlow-specific data structures for input data as follows: import tensorflow as tf labels = tf.placeholder(tf.float32, [batch_size, num_classes]) raw_data = tf.placeholder(tf.int32, [batch_size, max_sequence_length]) Given we are working with word tokens, we would like to represent them using a good feature representation technique. Let us assume the word embedding representation takes a word token and projects it onto an embedding space of dimension, embedding_dimension. The two-dimensional input data containing raw word tokens is now transformed into a three-dimensional word tensor with the added dimension representing the word embedding. We also use pre-computed word embedding, stored in a word_vectors data structure. We initialize the data structures as follows: data = tf.Variable(tf.zeros([batch_size, max_sequence_length, embedding_dimension]),dtype=tf.float32) data = tf.nn.embedding_lookup(word_vectors,raw_data) Now that the input data is ready, we look at defining the LSTM model. As shown previously, we need to create lstm_units of a basic LSTM cell. Since we need to perform a classification at the end, we wrap the LSTM unit with a dropout wrapper. To perform a full temporal pass of the data on the defined network, we unroll the LSTM using a dynamic_rnn routine of TensorFlow. We also initialize a random weight matrix and a constant value of 0.1 as the bias vector, as follows: weight = tf.Variable(tf.truncated_normal([lstm_units, num_classes])) bias = tf.Variable(tf.constant(0.1, shape=[num_classes])) lstm_cell = tf.contrib.rnn.BasicLSTMCell(lstm_units) wrapped_lstm_cell = tf.contrib.rnn.DropoutWrapper(cell=lstm_cell, output_keep_prob=0.8) output, state = tf.nn.dynamic_rnn(wrapped_lstm_cell, data, dtype=tf.float32) Once the output is generated by the dynamic unrolled RNN, we transpose its shape, multiply it by the weight vector, and add a bias vector to it to compute the final prediction value: output = tf.transpose(output, [1, 0, 2]) last = tf.gather(output, int(output.get_shape()[0]) - 1) prediction = (tf.matmul(last, weight) + bias) weight = tf.cast(weight, tf.float64) last = tf.cast(last, tf.float64) bias = tf.cast(bias, tf.float64) Since the initial prediction needs to be refined, we define an objective function with crossentropy to minimize the loss as follows: loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits (logits=prediction, labels=labels)) optimizer = tf.train.AdamOptimizer().minimize(loss) After this sequence of steps, we have a trained, end-to-end LSTM network for sentiment classification of arbitrary length sentences. To summarize, we saw how effectively we can implement LSTM network using TensorFlow. If you are interested to know more, check out this book Deep Learning Essentials which will help you take first steps in training efficient deep learning models and apply them in various practical scenarios.

0
0
16484

article-image-gather-intel-and-plan-attack-strategies

Packt

06 Mar 2018

2 min read

Gather Intel and Plan Attack Strategies

Packt

06 Mar 2018

2 min read

In this article by Himanshu Sharma, author of the Kali Linux - An Ethical Hacker's Cookbook, we will cover the following recipes: Getting a list of subdomains Shodan honeyscore Shodan plugins Using Nmap to find open ports (For more resources related to this topic, see here.) In this article,we'll dive a little deeper and look at other different tools available for gathering intel on our target. We'll start by using some of the infamous tools of Kali Linux, such as Fierce. Gathering information is a very crucial stage of performing a penetration test,as every step we take after this will totally be an outcome of all the information we gather during this stage. So it is very important that we gather as much information as possible before jumping into the exploitation stage. Getting a list of subdomains Not always do we have a situation where a client has defined a full detailed scope of what needs to be pentested. So, we will use the followingrecipes to gather as much information we can to perform a pentest. How to do it… We will see how to get a list of subdomains in the following ways: Fierce We'll start with jumping into Kali's terminal and using the first and mostly widely used tool,Fierce. To launch Fierce,type fierce –h to see the help menu: fierce –dns host.com –threads 10 To perform a subdomain scan, we use this command: fierce –dns host.com –threads 10 Dnsdumpster Dnsdumpster is a free project by HackerTarget to lookup subdomains. It relies on https://scans.io/ for its results. It is pretty simply to use.We type the domain name we want the subdomains for and it will show us the results. Using Shodan for fun and profit Shodan is the world's first search engine to search for devices connected on the Internet. It was launched in 2009 by John Matherly. Shodan can be used to lookup webcams, databases, industrial systems, videogames,and so on. Shodan mostly collects data on the most popular web services running, such as HTTP, HTTPS, MongoDB,and FTP. Getting ready To use Shodan, we will need to create an account. How to do it... Open your browser and visit https://www.shodan.io: We begin by performing a simple search for FTP services running.To do this, we can use the following Shodan dorks: port:"21" This search can be made more specific by specifying a particular country, organization,and so on: port:21 country:"IN" We can now see all the FTP servers running in India.We can also see the servers that allow anonymous login and the version of FTP server they are running. Next, we'll try the organization filter by typing the following: port:21 country:"IN"org:"BSNL" Shodan has other tags aswell, which can be used to perform advanced searches: net: To scan IP ranges city: To filter by city More details can be found at https://www.shodan.io/explore. Shodan honeyscore Shodan Honeyscore is another great project built in Python.It helps us figure out whether an IP address we have is a honeypot or a real system. How to do it... To use Shodan Honeyscore, visit https://honeyscore.shodan.io/: Enter the IP address you want to check, and that's it! Shodan plugins To make our lives even easier,Shodan has plugins for Chrome and Firefox that can be used to check for open ports for websites we visit on the go! How to do it... Download and install the plugin from https://www.shodan.io/. Browse any website, and you will see that by clicking on the plugin,you can see the open ports. Using Nmap to find open ports Nmap, or Network Mapper, is a security scanner written by Gordon Lyon. It is used to find hosts and services in a network. It first came out in September 1997. Nmap has various features as well as scripts to perform various tests, such as finding the OS andservice version,and it can be used to brute force default logins too. Some of the most common types of scan are as follows: TCP connect()scan SYN stealth scan UDP scan Ping scan Idle scan How to do it... Nmap comespre installed in Kali Linux. We can type the following command to start it and see all the options available: nmap –h To perform a basic scan,use the following command: nmap –sV –Pn x.x.x.x Here, –Pn implies that we do not check whether the host is up or not by performing a ping request first. The –sVparameter is to list all the running services on the found open ports. Another flag we can use is–A , which automatically performs OS detection, version detection, script scanning, and traceroute. The command is as follows: nmap –A –Pn x.x.x.x To scan an IP range or multiple IP's, we can use this command: nmap –A –Pn x.x.x.0/24 Using scripts NSE, or the Nmap scripting engine, allows users to create their own scripts to perform different tasks automatically. These scripts are executed side by side when a scan is run. They can be used to perform more effective version detection,exploitation of a vulnerability, and so on.The command for using a script is this: nmap –Pn –sV host.com –script dns-brute The following is the output of the preceding command: Here,the dns-brutescript tries to fetch available subdomains by brute forcing it against a set of common subdomain names. See also More information on the scripts can be found in the official NSE documentation at https://nmap.org/nsedoc/ Summary In this article, we learned how to get a list of subdomains on the network. Then we learned how to tell whether a system is a honeypot by calculating its Shodan Honeyscore. Chrome and Firefox have plugins that allow you to do this from your browser itself. Finally, we looked at how to use Nmap to find open ports. Resources for Article: Further resources on this subject: Wireless Attacks in Kali Linux [article] Introduction to Penetration Testing and Kali Linux [article] What is Kali Linux [article]

0
0
5158

article-image-how-to-compute-interpolation-in-scipy

Pravin Dhandre

05 Mar 2018

8 min read

How to Compute Interpolation in SciPy

Pravin Dhandre

05 Mar 2018

8 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides numerous recipes in mastering common tasks related to SciPy and associated libraries such as NumPy, pandas, and matplotlib.[/box] In today’s tutorial, we will see how to compute and solve polynomial, univariate interpolations using SciPy with detailed process and instructions. In this recipe, we will look at how to compute data polynomial interpolation by applying some important methods which are discussed in detail in the coming How to do it... section. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… Let's get started. In the following steps, we will explain how to compute a polynomial interpolation and the things we need to know: They require the following parameters: points: An ndarray of floats, shape (n, D) data point coordinates. It can be either an array of shape (n, D) or a tuple of ndim arrays. values: An ndarray of float or complex shape (n,) data values. xi: A 2D ndarray of float or tuple of 1D array, shape (M, D). Points at which to interpolate data. method: A {'linear', 'nearest', 'cubic'}—This is an optional method of interpolation. One of the nearest return value is at the data point closest to the point of interpolation. See NearestNDInterpolator for more details. linear tessellates the input point set to n-dimensional simplices, and interpolates linearly on each simplex. See LinearNDInterpolator for more details. cubic (1D): Returns the value determined from a cubic spline. cubic (2D): Returns the value determined from a piecewise cubic, continuously differentiable (C1), and approximately curvature-minimizing polynomial surface. See CloughTocher2DInterpolator for more details. fill_value: float; optional. It is the value used to fill in for requested points outside of the convex hull of the input points. If it is not provided, then the default is nan. This option has no effect on the nearest method. rescale: bool; optional. Rescale points to the unit cube before performing interpolation. This is useful if some of the input dimensions have non-commensurable units and differ by many orders of magnitude. How it works… One can see that the exact result is reproduced by all of the methods to some degree, but for this smooth function, the piecewise cubic interpolant gives the best results: import matplotlib.pyplot as plt import numpy as np methods = [None, 'none', 'nearest', 'bilinear', 'bicubic', 'spline16', 'spline36', 'hanning', 'hamming', 'hermite', 'kaiser', 'quadric', 'catrom', 'gaussian', 'bessel', 'mitchell', 'sinc', 'lanczos'] # Fixing random state for reproducibility np.random.seed(19680801) grid = np.random.rand(4, 4) fig, axes = plt.subplots(3, 6, figsize=(12, 6), subplot_kw={'xticks': [], 'yticks': []}) fig.subplots_adjust(hspace=0.3, wspace=0.05) for ax, interp_method in zip(axes.flat, methods): ax.imshow(grid, interpolation=interp_method, cmap='viridis') ax.set_title(interp_method) plt.show() This is the result of the execution: Univariate interpolation In the next section, we will look at how to solve univariate interpolation. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… The following table summarizes the different univariate interpolation modes coded in SciPy, together with the processes that we may use to resolve them: Finding a cubic spline that interpolates a set of data In this recipe, we will look at how to find a cubic spline that interpolates with the main method of spline. Getting ready We will need to follow some instructions and install the prerequisites. How to do it… We can use the following functions to solve the problems with this parameter: x: array_like, shape (n,). A 1D array containing values of the independent variable. The values must be real, finite, and in strictly increasing order. y: array_like. An array containing values of the dependent variable. It can have an arbitrary number of dimensions, but the length along axis must match the length of x. The values must be finite. axis: int; optional. The axis along which y is assumed to be varying, meaning for x[i], the corresponding values are np.take(y, i, axis=axis). The default is 0. bc_type: String or two-tuple; optional. Boundary condition type. Two additional equations, given by the boundary conditions, are required to determine all coefficients of polynomials on each segment. Refer to: https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.interpolate.CubicSpline.html#r59. If bc_type is a string, then the specified condition will be applied at both ends of a spline. The available conditions are: not-a-knot (default): The first and second segment at a curve end are the same polynomial. This is a good default when there is no information about boundary conditions. periodic: The interpolated function is assumed to be periodic in the period x[-1] - x[0]. The first and last value of y must be identical: y[0] == y[-1]. This boundary condition will result in y'[0] == y'[-1] and y''[0] == y''[-1]. clamped: The first derivatives at the curve ends are zero. Assuming there is a 1D y, bc_type=((1, 0.0), (1, 0.0)) is the same condition. natural: The second derivatives at the curve ends are zero. Assuming there is a 1D y, bc_type=((2, 0.0), (2, 0.0)) is the same condition. If bc_type is two-tuple, the first and the second value will be applied at the curve's start and end respectively. The tuple value can be one of the previously mentioned strings (except periodic) or a tuple (order, deriv_values), allowing us to specify arbitrary derivatives at curve ends: order: The derivative order; it is 1 or 2. deriv_value: An array_like containing derivative values. The shape must be the same as y, excluding the axis dimension. For example, if y is 1D, then deriv_value must be a scalar. If y is 3D with shape (n0, n1, n2) and axis=2, then deriv_value must be 2D and have the shape (n0, n1). extrapolate: {bool, 'periodic', None}; optional. bool, determines whether or not to extrapolate to out-of-bounds points based on first and last intervals, or to return NaNs. periodic, periodic extrapolation is used. If none (default), extrapolate is set to periodic for bc_type='periodic' and to True otherwise. How it works... We have the following example: %pylab inline from scipy.interpolate import CubicSpline import matplotlib.pyplot as plt x = np.arange(10) y = np.sin(x) cs = CubicSpline(x, y) xs = np.arange(-0.5, 9.6, 0.1) plt.figure(figsize=(6.5, 4)) plt.plot(x, y, 'o', label='data') plt.plot(xs, np.sin(xs), label='true') plt.plot(xs, cs(xs), label="S") plt.plot(xs, cs(xs, 1), label="S'") plt.plot(xs, cs(xs, 2), label="S''") plt.plot(xs, cs(xs, 3), label="S'''") plt.xlim(-0.5, 9.5) plt.legend(loc='lower left', ncol=2) plt.show() We can see the result here: We see the next example: theta = 2 * np.pi * np.linspace(0, 1, 5) y = np.c_[np.cos(theta), np.sin(theta)] cs = CubicSpline(theta, y, bc_type='periodic') print("ds/dx={:.1f} ds/dy={:.1f}".format(cs(0, 1)[0], cs(0, 1)[1])) x=0.0 ds/dy=1.0 xs = 2 * np.pi * np.linspace(0, 1, 100) plt.figure(figsize=(6.5, 4)) plt.plot(y[:, 0], y[:, 1], 'o', label='data') plt.plot(np.cos(xs), np.sin(xs), label='true') plt.plot(cs(xs)[:, 0], cs(xs)[:, 1], label='spline') plt.axes().set_aspect('equal') plt.legend(loc='center') plt.show() In the following screenshot, we can see the final result: Defining a B-spline for a given set of control points In the next section, we will look at how to solve B-splines given some controlled data. Getting ready We need to follow some instructions and install the prerequisites. How to do it… Univariate the spline in the B-spline basis Execute the following: S(x)=∑j=0n-1cjBj,k;t(x)S(x)=∑j=0n-1cjBj,k;t(x) Where it's Bj,k;tBj,k;t are B-spline basis functions of degree k and knots t We can use the following parameters: How it works ... Here, we construct a quadratic spline function on the base interval 2 <= x <= 4 and compare it with the naive way of evaluating the spline: from scipy import interpolate import numpy as np import matplotlib.pyplot as plt # sampling x = np.linspace(0, 10, 10) y = np.sin(x) # spline trough all the sampled points tck = interpolate.splrep(x, y) x2 = np.linspace(0, 10, 200) y2 = interpolate.splev(x2, tck) # spline with all the middle points as knots (not working yet) # knots = x[1:-1] # it should be something like this knots = np.array([x[1]]) # not working with above line and just seeing what this line does weights = np.concatenate(([1],np.ones(x.shape[0]-2)*.01,[1])) tck = interpolate.splrep(x, y, t=knots, w=weights) x3 = np.linspace(0, 10, 200) y3 = interpolate.splev(x2, tck) # plot plt.plot(x, y, 'go', x2, y2, 'b', x3, y3,'r') plt.show() Note that outside of the base interval, results differ. This is because BSpline extrapolates the first and last polynomial pieces of B-spline functions active on the base interval. This is the result of solving the problem: We successfully compute numerical computation and find interpolation function using polynomial and univariate interpolation coded in SciPy. If you found this tutorial useful, do check out the book SciPy Recipes to get quick recipes for performing other mathematical operations like differential equation, K-means and Discrete Fourier Transform.

0
0
11456

How to perform Audio-Video-Image Scraping with Python

4 common challenges in Web Scraping and how to handle them

Learning Dependency Injection (DI)

Spam Filtering - Natural Language Processing Approach

Data Exploration using Spark SQL

How to set up a Deep Learning System on Amazon Web Services (AWS)

ASP.NET Core High Performance

Working with Forensic Evidence Container Recipes

Administering ARCGIS Enterprise through the REST administrative directories

Implementing matrix operations using SciPy and NumPy

Trending Topics

Introduction to ASP.NET Core Web API

Logistic Regression Using TensorFlow

Implement Long-short Term Memory (LSTM) with TensorFlow

Gather Intel and Plan Attack Strategies

How to Compute Interpolation in SciPy

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access