How-To Tutorials

22 Jun 2017

20 min read

Tangled Web? Not At All!

22 Jun 2017

In this article by Clif Flynt, the author of the book Linux Shell Scripting Cookbook - Third Edition, we can see a collection of shell-scripting recipes that talk to services on the Internet. This articleis intended to help readers understand how to interact with the Web using shell scripts to automate tasks such as collecting and parsing data from web pages. This is discussed using POST and GET to web pages, writing clients to web services. (For more resources related to this topic, see here.) In this article, we will cover the following recipes: Downloading a web page as plain text Parsing data from a website Image crawler and downloader Web photo album generator Twitter command-line client Tracking changes to a website Posting to a web page and reading response Downloading a video from the Internet The Web has become the face of technology and the central access point for data processing. The primary interface to the web is via a browser that's designed for interactive use. That's great for searching and reading articles on the web, but you can also do a lot to automate your interactions with shell scripts. For instance, instead of checking a website daily to see if your favorite blogger has added a new blog, you can automate the check and be informed when there's new information. Similarly, twitter is the current hot technology for getting up-to-the-minute information. But if I subscribe to my local newspaper's twitter account because I want the local news, twitter will send me all news, including high-school sports that I don't care about. With a shell script, I can grab the tweets and customize my filters to match my desires, not rely on their filters. Downloading a web page as plain text Web pages are simply text with HTML tags, JavaScript and CSS. The HTML tags define the content of a web page, which we can parse for specific content. Bash scripts can parse web pages. An HTML file can be viewed in a web browser to see it properly formatted. Parsing a text document is simpler than parsing HTML data because we aren't required to strip off the HTML tags. Lynx is a command-line web browser which download a web page as plaintext. Getting Ready Lynx is not installed in all distributions, but is available via the package manager. # yum install lynx or apt-get install lynx How to do it... Let's download the webpage view, in ASCII character representation, in a text file by using the -dump flag with the lynx command: $ lynx URL -dump > webpage_as_text.txt This command will list all the hyperlinks <a href="link"> separately under a heading References, as the footer of the text output. This lets us parse links separately with regular expressions. For example: $lynx -dump http://google.com > plain_text_page.txt You can see the plaintext version of text by using the cat command: $ cat plain_text_page.txt Search [1]Images [2]Maps [3]Play [4]YouTube [5]News [6]Gmail [7]Drive [8]More » [9]Web History | [10]Settings | [11]Sign in [12]St. Patrick's Day 2017 _______________________________________________________ Google Search I'm Feeling Lucky [13]Advanced search [14]Language tools [15]Advertising Programs [16]Business Solutions [17]+Google [18]About Google © 2017 - [19]Privacy - [20]Terms References Parsing data from a website The lynx, sed, and awk commands can be used to mine data from websites. How to do it... Let's go through the commands used to parse details of actresses from the website: $ lynx -dump -nolist http://www.johntorres.net/BoxOfficefemaleList.html | grep -o "Rank-.*" | sed -e 's/ *Rank-([0-9]*) *(.*)/1t2/' | sort -nk 1 > actresslist.txt The output is: # Only 3 entries shown. All others omitted due to space limits 1 Keira Knightley 2 Natalie Portman 3 Monica Bellucci How it works... Lynx is a command-line web browser—it can dump a text version of a website as we would see in a web browser, instead of returning the raw html as wget or cURL do. This saves the step of removing HTML tags. The -nolist option shows the links without numbers. Parsing and formatting the lines that contain Rank is done with sed: sed -e 's/ *Rank-([0-9]*) *(.*)/1t2/' These lines are then sorted according to the ranks. See also The Downloading a web page as plain text recipe in this article explains the lynx command. Image crawler and downloader Image crawlers download all the images that appear in a web page. Instead of going through the HTML page by hand to pick the images, we can use a script to identify the images and download them automatically. How to do it... This Bash script will identify and download the images from a web page: #!/bin/bash #Desc: Images downloader #Filename: img_downloader.sh if [ $# -ne 3 ]; then echo "Usage: $0 URL -d DIRECTORY" exit -1 fi while [ -n $1 ] do case $1 in -d) shift; directory=$1; shift ;; *) url=$1; shift;; esac done mkdir -p $directory; baseurl=$(echo $url | egrep -o "https?://[a-z.-]+") echo Downloading $url curl -s $url | egrep -o "<imgsrc=[^>]*>" | sed's/<imgsrc="([^"]*).*/1/g' | sed"s,^/,$baseurl/,"> /tmp/$$.list cd $directory; while read filename; do echo Downloading $filename curl -s -O "$filename" --silent done < /tmp/$$.list An example usage is: $ ./img_downloader.sh http://www.flickr.com/search/?q=linux -d images How it works... The image downloader script reads an HTML page, strips out all tags except <img>, parses src="URL" from the <img> tag, and downloads them to the specified directory. This script accepts a web page URL and the destination directory as command-line arguments. The [ $# -ne 3 ] statement checks whether the total number of arguments to the script is three, otherwise it exits and returns a usage example. Otherwise, this code parses the URL and destination directory: while [ -n "$1" ] do case $1 in -d) shift; directory=$1; shift ;; *) url=${url:-$1}; shift;; esac done The while loop runs until all the arguments are processed. The shift command shifts arguments to the left so that $1 will take the next argument's value; that is, $2, and so on. Hence, we can evaluate all arguments through $1 itself. The case statement checks the first argument ($1). If that matches -d, the next argument must be a directory name, so the arguments are shifted and the directory name is saved. If the argument is any other string it is a URL. The advantage of parsing arguments in this way is that we can place the -d argument anywhere in the command line: $ ./img_downloader.sh -d DIR URL Or: $ ./img_downloader.sh URL -d DIR The egrep -o "<imgsrc=[^>]*>"code will print only the matching strings, which are the <img> tags including their attributes. The phrase [^>]*matches all the characters except the closing >, that is, <imgsrc="image.jpg">. sed's/<imgsrc="([^"]*).*/1/g' extracts the url from the string src="url". There are two types of image source paths—relative and absolute. Absolute paths contain full URLs that start with http:// or https://. Relative URLs starts with / or image_name itself. An example of an absolute URL is http://example.com/image.jpg. An example of a relative URL is /image.jpg. For relative URLs, the starting / should be replaced with the base URL to transform it to http://example.com/image.jpg. The script initializes the baseurl by extracting it from the initial url with the command: baseurl=$(echo $url | egrep -o "https?://[a-z.-]+") The output of the previously described sed command is piped into another sed command to replace a leading / with the baseurl, and the results are saved in a file named for the script's PID: /tmp/$$.list. sed"s,^/,$baseurl/,"> /tmp/$$.list The final while loop iterates through each line of the list and uses curl to downloas the images. The --silent argument is used with curl to avoid extra progress messages from being printed on the screen. The final while loop iterates through each line of the list and uses curl to downloas the images. The --silent argument is used with curl to avoid extra progress messages from being printed on the screen. Web photo album generator Web developers frequently create photo albums of full sized and thumbnail images. When a thumbnail is clicked, a large version of the picture is displayed. This requires resizing and placing many images. These actions can be automated with a simple bash script. The script creates thumbnails, places them in exact directories, and generates the code fragment for <img> tags automatically. Web developers frequently create photo albums of full sized and thumbnail images. When a thumbnail is clicked, a large version of the picture is displayed. This requires resizing and placing many images. These actions can be automated with a simple bash script. The script creates thumbnails, places them in exact directories, and generates the code fragment for <img> tags automatically. Getting ready This script uses a for loop to iterate over every image in the current directory. The usual Bash utilities such as cat and convert (from the Image Magick package) are used. These will generate an HTML album, using all the images, in index.html. How to do it... This Bash script will generate an HTML album page: #!/bin/bash #Filename: generate_album.sh #Description: Create a photo album using images in current directory echo "Creating album.." mkdir -p thumbs cat <<EOF1 > index.html <html> <head> <style> body { width:470px; margin:auto; border: 1px dashed grey; padding:10px; } img { margin:5px; border: 1px solid black; } </style> </head> <body> <center><h1> #Album title </h1></center> <p> EOF1 for img in *.jpg; do convert "$img" -resize "100x""thumbs/$img" echo "<a href="$img">">>index.html echo "<imgsrc="thumbs/$img" title="$img" /></a>">> index.html done cat <<EOF2 >> index.html </p> </body> </html> EOF2 echo Album generated to index.html Run the script as follows: $ ./generate_album.sh Creating album.. Album generated to index.html How it works... The initial part of the script is used to write the header part of the HTML page. The following script redirects all the contents up to EOF1 to index.html: cat <<EOF1 > index.html contents... EOF1 The header includes the HTML and CSS styling. for img in *.jpg *.JPG; iterates over the file names and evaluates the body of the loop. convert "$img" -resize "100x""thumbs/$img" creates images of 100 px width as thumbnails. The following statements generate the required <img> tag and appends it to index.html: echo "<a href="$img">" echo "<imgsrc="thumbs/$img" title="$img" /></a>">> index.html Finally, the footer HTML tags are appended with cat as done in the first part of the script. Twitter command-line client Twitter is the hottest micro-blogging platform, as well as the latest buzz of the online social media now. We can use Twitter API to read tweets on our timeline from the command line! Twitter is the hottest micro-blogging platform, as well as the latest buzz of the online social media now. We can use Twitter API to read tweets on our timeline from the command line! Let's see how to do it. Getting ready Recently, Twitter stopped allowing people to log in by using plain HTTP Authentication, so we must use OAuth to authenticate ourselves. Perform the following steps: Download the bash-oauth library from https://github.com/livibetter/bash-oauth/archive/master.zip, and unzip it to any directory. Go to that directory and then inside the subdirectory bash-oauth-master, run make install-all as root.Go to https://apps.twitter.com/ and register a new app. This will make it possible to use OAuth. After registering the new app, go to your app's settings and change Access type to Read and Write. Now, go to the Details section of the app and note two things—Consumer Key and Consumer Secret, so that you can substitute these in the script we are going to write. Great, now let's write the script that uses this. How to do it... This Bash script uses the OAuth library to read tweets or send your own updates. #!/bin/bash #Filename: twitter.sh #Description: Basic twitter client oauth_consumer_key=YOUR_CONSUMER_KEY oauth_consumer_scret=YOUR_CONSUMER_SECRET config_file=~/.$oauth_consumer_key-$oauth_consumer_secret-rc if [[ "$1" != "read" ]] && [[ "$1" != "tweet" ]]; then echo -e "Usage: $0 tweet status_messagen ORn $0 readn" exit -1; fi #source /usr/local/bin/TwitterOAuth.sh source bash-oauth-master/TwitterOAuth.sh TO_init if [ ! -e $config_file ]; then TO_access_token_helper if (( $? == 0 )); then echo oauth_token=${TO_ret[0]} > $config_file echo oauth_token_secret=${TO_ret[1]} >> $config_file fi fi source $config_file if [[ "$1" = "read" ]]; then TO_statuses_home_timeline'''YOUR_TWEET_NAME''10' echo $TO_ret | sed's/,"/n/g' | sed's/":/~/' | awk -F~ '{} {if ($1 == "text") {txt=$2;} else if ($1 == "screen_name") printf("From: %sn Tweet: %snn", $2, txt);} {}' | tr'"''' elif [[ "$1" = "tweet" ]]; then shift TO_statuses_update''"$@" echo 'Tweeted :)' fi Run the script as follows: $./twitter.sh read Please go to the following link to get the PIN: https://api.twitter.com/oauth/authorize?oauth_token=LONG_TOKEN_STRING PIN: PIN_FROM_WEBSITE Now you can create, edit and present Slides offline. - by A Googler $./twitter.sh tweet "I am reading Packt Shell Scripting Cookbook" Tweeted :) $./twitter.sh read | head -2 From: Clif Flynt Tweet: I am reading Packt Shell Scripting Cookbook How it works... First of all, we use the source command to include the TwitterOAuth.sh library, so we can use its functions to access Twitter. The TO_init function initializes the library. Every app needs to get an OAuth token and token secret the first time it is used. If these are not present, we use the library function TO_access_token_helper to acquire them. Once we have the tokens, we save them to a config file so we can simply source it the next time the script is run. The library function TO_statuses_home_timeline fetches the tweets from Twitter. This data is retuned as a single long string in JSON format, which starts like this: [{"created_at":"Thu Nov 10 14:45:20 +0000 "016","id":7...9,"id_str":"7...9","text":"Dining... Each tweet starts with the created_at tag and includes a text and a screen_nametag. The script will extract the text and screen name data and display only those fields. The script assigns the long string to the variable TO_ret. The JSON format uses quoted strings for the key and may or may not quote the value. The key/value pairs are separated by commas, and the key and value are separated by a colon :. The first sed to replaces each," character set with a newline, making each key/value a separate line. These lines are piped to another sed command to replace each occurrence of ": with a tilde ~ which creates a line like screen_name~"Clif_Flynt" The final awk script reads each line. The -F~ option splits the line into fields at the tilde, so $1 is the key and $2 is the value. The if command checks for text or screen_name. The text is first in the tweet, but it's easier to read if we report the sender first, so the script saves a text return until it sees a screen_name, then prints the current value of $2 and the saved value of the text. The TO_statuses_updatelibrary function generates a tweet. The empty first parameter defines our message as being in the default format, and the message is a part of the second parameter. Tracking changes to a website Tracking website changes is useful to both web developers and users. Checking a website manually impractical, but a change tracking script can be run at regular intervals. When a change occurs, it generate a notification. Getting ready Tracking changes in terms of Bash scripting means fetching websites at different times and taking the difference by using the diff command. We can use curl and diff to do this. How to do it... This bash script combines different commands, to track changes in a webpage: #!/bin/bash #Filename: change_track.sh #Desc: Script to track changes to webpage if [ $# -ne 1 ]; then echo -e "$Usage: $0 URLn" exit 1; fi first_time=0 # Not first time if [ ! -e "last.html" ]; then first_time=1 # Set it is first time run fi curl --silent $1 -o recent.html if [ $first_time -ne 1 ]; then changes=$(diff -u last.html recent.html) if [ -n "$changes" ]; then echo -e "Changes:n" echo "$changes" else echo -e "nWebsite has no changes" fi else echo "[First run] Archiving.." fi cp recent.html last.html Let's look at the output of the track_changes.sh script on a website you control. First we'll see the output when a web page is unchanged, and then after making changes. Note that you should change MyWebSite.org to your website name. First, run the following command: $ ./track_changes.sh http://www.MyWebSite.org [First run] Archiving.. Second, run the command again. $ ./track_changes.sh http://www.MyWebSite.org Website has no changes Third, run the following command after making changes to the web page: $ ./track_changes.sh http://www.MyWebSite.org Changes: --- last.html 2010-08-01 07:29:15.000000000 +0200 +++ recent.html 2010-08-01 07:29:43.000000000 +0200 @@ -1,3 +1,4 @@ +added line :) data How it works... The script checks whether the script is running for the first time by using [ ! -e "last.html" ];. If last.html doesn't exist, it means that it is the first time and, the webpage must be downloaded and saved as last.html. If it is not the first time, it downloads the new copy recent.html and checks the difference with the diff utility. Any changes will be displayed as diff output.Finally, recent.html is copied to last.html. Note that changing the website you're checking will generate a huge diff file the first time you examine it. If you need to track multiple pages, you can create a folder for each website you intend to watch. Posting to a web page and reading the response POST and GET are two types of requests in HTTP to send information to, or retrieve information from a website. In a GET request, we send parameters (name-value pairs) through the webpage URL itself. The POST command places the key/value pairs in the message body instead of the URL. POST is commonly used when submitting long forms or to conceal the information submitted from a casual glance. Getting ready For this recipe, we will use the sample guestbook website included in the tclhttpd package. You can download tclhttpd from http://sourceforge.net/projects/tclhttpd and then run it on your local system to create a local webserver. The guestbook page requests a name and URL which it adds to a guestbook to show who has visited a site when the user clicks the Add me to your guestbook button. This process can be automated with a single curl (or wget) command. How to do it... Download the tclhttpd package and cd to the bin folder. Start the tclhttpd daemon with this command: tclsh httpd.tcl The format to POST and read the HTML response from generic website resembles this: $ curl URL -d "postvar=postdata2&postvar2=postdata2" Consider the following example: $ curl http://127.0.0.1:8015/guestbook/newguest.html -d "name=Clif&url=www.noucorp.com&http=www.noucorp.com" curl prints a response page like this: <HTML> <Head> <title>Guestbook Registration Confirmed</title> </Head> <Body BGCOLOR=white TEXT=black> <a href="www.noucorp.com">www.noucorp.com</a> <DL> <DT>Name <DD>Clif <DT>URL <DD> </DL> www.noucorp.com </Body> -d is the argument used for posting. The string argument for -d is similar to the GET request semantics. var=value pairs are to be delimited by &. You can POST the data using wget by using --post-data "string". For example: $ wgethttp://127.0.0.1:8015/guestbook/newguest.cgi --post-data "name=Clif&url=www.noucorp.com&http=www.noucorp.com" -O output.html Use the same format as cURL for name-value pairs. The text in output.html is the same as that returned by the cURL command. The string to the post arguments (for example, to -d or --post-data) should always be given in quotes. If quotes are not used, & is interpreted by the shell to indicate that this should be a background process. How to do it... If you look at the website source (use the View Source option from the web browser), you will see an HTML form defined, similar to the following code: <form action="newguest.cgi"" method="post"> <ul> <li> Name: <input type="text" name="name" size="40"> <li> Url: <input type="text" name="url" size="40"> <input type="submit"> </ul> </form> Here, newguest.cgi is the target URL. When the user enters the details and clicks on the Submit button, the name and url inputs are sent to newguest.cgi as a POST request, and the response page is returned to the browser. Downloading a video from the internet There are many reasons for downloading a video. If you are on a metered service, you might want to download videos during off-hours when the rates are cheaper. You might want to watch videos where the bandwidth doesn't support streaming, or you might just want to make certain that you always have that video of cute cats to show your friends. Getting ready One program for downloading videos is youtube-dl. This is not included in most distributions and the repositories may not be up to date, so it's best to go to the youtube-dl main site:http://yt-dl.org You'll find links and information on that page for downloading and installing youtube-dl. How to do it… Using youtube-dl is easy. Open your browser and find a video you like. Then copy/paste that URL to the youtube-dl command line. youtube-dl https://www.youtube.com/watch?v=AJrsl3fHQ74 While youtube-dl is downloading the file it will generate a status line on your terminal. How it works… The youtube-dl program works by sending a GET message to the server, just as a browser would do. It masquerades as a browser so that YouTube or other video providers will download a video as if the device were streaming. The –list-formats (-F) option will list the available formats a video is available in, and the –format (-f) option will specify which format to download. This is useful if you want to download a higher-resolution video than your internet connection can reliably stream. Summary In this article we learned how to download and parse website data, send data to forms, and automate website-usage tasks and similar activities. We can automate many activities that we perform interactively through a browser with a few lines of scripting. Resources for Article: Further resources on this subject: Linux Shell Scripting – various recipes to help you [article] Linux Shell Script: Tips and Tricks [article] Linux Shell Script: Monitoring Activities [article]

0
0
33382

Packt

21 Jun 2017

8 min read

Setting up Intel Edison

Packt

21 Jun 2017

8 min read

In this article by Avirup Basu, the author of the book Intel Edison Projects, we will be covering the following topics: Setting up the Intel Edison Setting up the developer environment (For more resources related to this topic, see here.) In every Internet of Things(IoT) or robotics project, we have a controller that is the brain of the entire system. Similarly we have Intel Edison. The Intel Edison computing module comes in two different packages. One of which is a mini breakout board the other of which is an Arduino compatible board. One can use the board in its native state as well but in that case the person has to fabricate his/hers own expansion board. The Edison is basically a size of a SD card. Due to its tiny size, it's perfect for wearable devices. However it's capabilities makes it suitable for IoT application and above all, the powerful processing capability makes it suitable for robotics application. However we don't simply use the device in this state. We hook up the board with an expansion board. The expansion board provides the user with enough flexibility and compatibility for interfacing with other units. The Edison has an operating system that is running the entire system. It runs a Linux image. Thus, to setup your device, you initially need to configure your device both at the hardware and at software level. Initial hardware setup We'll concentrate on the Edison package that comes with an Arduino expansion board. Initially you will get two different pieces: The Intel® Edison board The Arduino expansion board The following given is the architecture of the device: Architecture of Intel Edison. Picture Credits: https://software.intel.com/en-us/ We need to hook these two pieces up in a single unit. Place the Edison board on top of the expansion board such that the GPIO interfaces meet at a single point. Gently push the Edison against the expansion board. You will get a click sound. Use the screws that comes with the package to tighten the set up. Once, this is done, we'll now setup the device both at hardware level and software level to be used further. Following are the steps we'll cover in details: Downloading necessary software packages Connecting your Intel® Edison to your PC Flashing your device with the Linux image Connecting to a Wi-Fi network SSH-ing your Intel® Edison device Downloading necessary software packages To move forward with the development on this platform, we need to download and install a couple of software which includes the drivers and the IDEs. Following is the list of the software along with the links that are required: Intel® Platform Flash Tool Lite (https://01.org/android-ia/downloads/intel-platform-flash-tool-lite) PuTTY (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html) Intel XDK for IoT (https://software.intel.com/en-us/intel-xdk) Arduino IDE (https://www.arduino.cc/en/Main/Software) FileZilla FTP client (https://filezilla-project.org/download.php) Notepad ++ or any other editor (https://notepad-plus-plus.org/download/v7.3.html) Drivers and miscellaneous downloads Latest Yocto* Poky image Windows standalone driver for Intel Edison FTDI drivers (http://www.ftdichip.com/Drivers/VCP.htm) The 1st and the 2nd packages can be downloaded from (https://software.intel.com/en-us/iot/hardware/edison/downloads) Plugging in your device After all the software and drivers installation, we'll now connect the device to a PC. You need two Micro-B USB Cables(s) to connect your device to the PC. You can also use a 9V power adapter and a single Micro-B USB Cable, but for now we will not use the power adapter: Different sections of Arduino expansion board of Intel Edison A small switch exists between the USB port and the OTG port. This switch must be towards the OTG port because we're going to power the device from the OTG port and not through the DC power port. Once it is connected to your PC, open your device manager and expands the ports section. If all installations of drivers were successful, then you must see two ports: Intel Edison virtual com port USB serial port Flashing your device Once your device is successfully detected an installed, you need to flash your device with the Linux image. For this we'll use the flash tool provided by Intel: Open the flash lite tool and connect your device to the PC: Intel phone flash lite tool Once the flash tool is opened, click on Browse... and browse to the .zip file of the Linux image you have downloaded. After you click on OK, the tool will automatically unzip the file. Next, click on Start to flash: Intel® Phone flash lite tool – stage 1 You will be asked to disconnect and reconnect your device. Do as the tool says and the board should start flashing. It may take some time before the flashing is completed. You are requested not to tamper with the device during the process. Once the flashing is completed, we'll now configure the device: Intel® Phone flash lite tool – complete Configuring the device After flashing is successfully we'll now configure the device. We're going to use the PuTTY console for the configuration. PuTTY is an SSH and telnet client, developed originally by Simon Tatham for the Windows platform. We're going to use the serial section here. Before opening PuTTY console: Open up the device manager and note the port number for USB serial port. This will be used in your PuTTY console: Ports for Intel® Edison in PuTTY Next select Serialon PuTTY console and enter the port number. Use a baud rate of 115200. Press Open to open the window for communicating with the device: PuTTY console – login screen Once you are in the console of PuTTY, then you can execute commands to configure your Edison. Following is the set of tasks we'll do in the console to configure the device: Provide your device a name Provide root password (SSH your device) Connect your device to Wi-Fi Initially when in the console, you will be asked to login. Type in root and press Enter. Once entered you will see root@edison which means that you are in the root directory: PuTTY console – login success Now, we are in the Linux Terminal of the device. Firstly, we'll enter the following command for setup: configure_edison –setup Press Enter after entering the command and the entire configuration will be somewhat straightforward: PuTTY console – set password Firstly, you will be asked to set a password. Type in a password and press Enter. You need to type in your password again for confirmation. Next, we'll set up a name for the device: PuTTY console – set name Give a name for your device. Please note that this is not the login name for your device. It's just an alias for your device. Also the name should be at-least 5 characters long. Once you entered the name, it will ask for confirmation press y to confirm. Then it will ask you to setup Wi-Fi. Again select y to continue. It's not mandatory to setup Wi-Fi, but it's recommended. We need the Wi-Fi for file transfer, downloading packages, and so on: PuTTY console – set Wi-Fi Once the scanning is completed, we'll get a list of available networks. Select the number corresponding to your network and press Enter. In this case it 5 which corresponds to avirup171which is my Wi-Fi. Enter the network credentials. After you do that, your device will get connected to the Wi-Fi. You should get an IP after your device is connected: PuTTY console – set Wi-Fi -2 After successful connection you should get this screen. Make sure your PC is connected to the same network. Open up the browser in your PC, and enter the IP address as mentioned in the console. You should get a screen similar to this: Wi-Fi setup – completed Now, we are done with the initial setup. However Wi-Fi setup normally doesn't happens in one go. Sometimes your device doesn't gets connected to the Wi-Fi and sometimes we cannot get this page as shown before. In those cases you need to start wpa_cli to manually configure the Wi-Fi. Refer to the following link for the details: http://www.intel.com/content/www/us/en/support/boards-and-kits/000006202.html Summary In this article, we have covered the areas of initial setup of Intel Edison and configuring it to the network. We have also covered how to transfer files to the Edison and vice versa. Resources for Article: Further resources on this subject: Getting Started with Intel Galileo [article] Creating Basic Artificial Intelligence [article] Using IntelliTrace to Diagnose Problems with a Hosted Service [article]

0
0
26088

How-To Tutorials

Packt

20 Jun 2017

14 min read

CORS in Node.js

Packt

20 Jun 2017

14 min read

0
0
25710

How-To Tutorials

article-image-grouping-sets-advanced-sql

Packt

20 Jun 2017

6 min read

Grouping Sets in Advanced SQL

Packt

20 Jun 2017

6 min read

0
0
2721

article-image-analyzing-social-networks-facebook

Packt

20 Jun 2017

15 min read

Analyzing Social Networks with Facebook

Packt

20 Jun 2017

15 min read

In this article by Raghav Bali, Dipanjan Sarkar and Tushar Sharma, the authors of the book Learning Social Media Analytics with R, we got a good flavor of the various aspects related to the most popular social micro-blogging platform, Twitter. In this article, we will look more closely at the most popular social networking platform, Facebook. With more than 1.8 billion monthly active users, over 18 billion dollars annual revenue and record breaking acquisitions for popular products including Oculus, WhatsApp and Instagram have truly made Facebook the core of the social media network today. (For more resources related to this topic, see here.) Before we put Facebook data under the microscope, let us briefly look at Facebook’s interesting origins! Like many popular products, businesses and organizations, Facebook too had a humble beginning. Originally starting off as Mark Zuckerberg’s brainchild in 2004, it was initially known as “Thefacebook” located at thefacebook.com, which was branded as an online social network, connecting university and college students. While this social network was only open to Harvard students in the beginning, it soon expanded within a month by including students from other popular universities. In 2005, the domain facebook.com was finally purchased and “Facebook” extended its membership to employees of companies and organizations for the first time. Finally in 2006, Facebook was finally opened to everyone above 13 years of age and having a valid email address. The following snapshot shows us how the look and feel of the Facebook platform has evolved over the years! Facebook’s evolving look over time While Facebook has a primary website, also known as a web application, it has also launched mobile applications for the major operating systems on handheld devices. In short, Facebook is not just a social network website but an entire platform including a huge social network of connected people and organizations through friends, followers and pages. We will leverage Facebook’s social “Graph API” to access actual Facebook data to perform various analyses. Users, brands, business, news channels, media houses, retail stores and many more are using Facebook actively on a daily basis for producing and consuming content. This generates vast amount of data and a substantial amount of this is available to users through its APIs. From a social media analytics perspective, this is really exciting because this treasure trove of data with easy to access APIs and powerful open source libraries from R, gives us enormous potential and opportunity to get valuable information from analyzing this data in various ways. We will follow a structured path in this article and cover the following major topics sequentially to ensure that you do not get overwhelmed with too much content at once. Accessing Facebook data Analyzing your personal social network Analyzing an English football social network Analyzing English football clubs’ brand page engagements We will use libraries like Rfacebook, igraph and ggplot2 to retrieve, analyze and visualize data from Facebook. All the following sections of the book assume that you have a Facebook account which is necessary to access data from the APIs and analyze it. In case you do not have an account, do not despair. You can use the data and code files for this article to follow along with the hands-on examples to gain a better understanding of the concepts of social network and engagement analysis. Accessing Facebook data You will find a lot of content in several books and on the web about various techniques to access and retrieve data from Facebook. There are several official ways of doing this which include using the Facebook Graph API either directly through low level HTTP based calls or indirectly through higher level abstract interfaces belonging to libraries like Rfacebook. Some alternate ways of retrieving Facebook data would be to use registered applications on Facebook like Netvizz or the GetNet application built by Lada Adamic, used in her very popular “Social Network Analysis” course (Unfortunately http://snacourse.com/getnet is not working since Facebook completely changed its API access permissions and privacy settings). Unofficial ways include techniques like web scraping and crawling to extract data. Do note though that Facebook considers this to be a violation of its terms and conditions of accessing data and you should try and avoid crawling Facebook for data especially if you plan to use it for commercial purposes. In this section, we will take a closer look at the Graph API and the Rfacebook package in R. The main focus will be on how you can extract data from Facebook using both of them. Understanding the Graph API To start using the Graph API, you would need to have an account on Facebook to be able to use the API. You can access the API in various ways. You can create an application on Facebook by going to https://developers.facebook.com/apps/ and then create a long-lived OAuth access token using the fbOAuth(…)function from the Rfacebook package. This enables R to make calls to the Graph API and you can also store this token on the disk and load it for future use. An easier way is to create a short-lived token which would let you access the API data for about two hours by going to the Facebook Graph API Explorer page which is available at https://developers.facebook.com/tools/explorer and get a temporary access token from there. The following snapshot depicts how to get an access token for the Graph API from Facebook. Facebook’s Graph API explorer On clicking “Get User Access Token” in the above snapshot, it will present a list of checkboxes with various permissions which you might need for accessing data including user data permissions, events, groups and pages and other miscellaneous permissions. You can select the ones you need and click on the “Get Access Token” button in the prompt. This will generate a new access token the field depicted in the above snapshot and you can directly copy and use it to retrieve data in R. Before going into that, we will take a closer look at the Graph API explorer which directly allows you to access the API from your web browser itself and helps if you want to do some quick exploratory analysis. A part of it is depicted in the above snapshot. The current version of the API when writing this book is v2.8 which you can see in the snapshot beside the GET resource call. Interestingly, the Graph API is so named because Facebook by itself can be considered as a huge social graph where all the information can be classified into the following three categories. Nodes: These are basically users, pages, photos and so on. Nodes indicate a focal point of interest which is connected to other points. Edges: These connect various nodes together forming the core social graph and these connections are based on various relations like friends, followers and so on. Fields: These are specific attributes or properties about nodes, an example would be a user’s address, birthday, name and so on. Like we mentioned before, the API is HTTP based and you can make HTTPGET requests to nodes or edges and all requests are passed to graph.facebook.com to get data. Each node usually has a specific identifier and you can use it for querying information about a node as depicted in the following snippet. GET graph.facebook.com /{node-id} You can also use edge names in addition to the identifier to get information about the edges of the node. The following snippet depicts how you can do the same. GET graph.facebook.com /{node-id}/{edge-name} The following snapshot shows us how we can get information about our own profile. Querying your details in the Graph API explorer Now suppose, I wanted to retrieve information about a Facebook page,“Premier League” which represents the top tier competition in English Football using its identifier and also take a look at its liked pages. I can do the same using the following request. Querying information about a Facebook Page using the Graph API explorer Thus from the above figure, you can clearly see the node identifier, page name and likes for the page, “Premier League”. It must be clear by now that all API responses are returned in the very popular JSON format which is easy to parse and format as needed for analysis. Besides this, there also used to be another way of querying the social graph in Facebook, which was known as FQL or Facebook Query Language, an SQL like interface for querying and retrieving data. Unfortunately, Facebook seems to have deprecated its use and hence covering it would be out of our present scope. Now that you have a firm grasp on the syntax of the Graph API and have also seen a few examples of how to retrieve data from Facebook, we will take a closer look at the Rfacebook package. Understanding Rfacebook Since we will be accessing and analyzing data from Facebook using R, it makes sense to have some robust mechanism to directly query Facebook and retrieve data instead of going to the browser every time like we did in the earlier section. Fortunately, there is an excellent package in R called Rfacebook which has been developed by Pablo Barberá. You can either install it from CRAN or get its most updated version from GitHub. The following snippet depicts how you can do the same. Remember you might need to install the devtools package if you don’t have it already, to download and install the latest version of the Rfacebook package from GitHub. install.packages("Rfacebook") # install from CRAN # install from GitHub library(devtools) install_github("pablobarbera/Rfacebook/Rfacebook") Once you install the package, you can load up the package using load(Rfacebook) and start using it to retrieve data from Facebook by using the access token you generated earlier. The following snippet shows us how you can access your own details like we had mentioned in the previous section, but this time by using R. > token = 'XXXXXX' > me <- getUsers("me", token=token) > me$name [1] "Dipanjan Sarkar" > me$id [1] "1026544" The beauty of this package is that you directly get the results in curated and neatly formatted data frames and you do not need to spend extra time trying to parse the raw JSON response objects from the Graph API. The package is well documented and has high level functions for accessing personal profile data on Facebook as well as page and group level data points. We will now take a quick look at Netvizz a Facebook application, which can also be used to extract data easily from Facebook. Understanding Netvizz The Netvizz application was developed by Bernhard Rieder and is a tool which can be used to extract data from Facebook pages, groups, get statistics about links and also extract social networks from Facebook pages based on liked pages from each connected page in the network. You can access Netvizz at https://apps.facebook.com/netvizz/ and on registering the application on your profile, you will be able to see the following screen. The Netvizz application interface From the above app snapshot, you can see that there are various links based on the type of operation you want to execute to extract data. Feel free to play around with this tool and we will be using its “page like network” capability later on in one of our analyses in a future section. Data Access Challenges There are several challenges with regards to accessing data from Facebook. Some of the major issues and caveats have been mentioned in the following points: Facebook will keep evolving and updating its data access APIs and this can and will lead to changes and deprecation of older APIs and access patterns just like FQL was deprecated. Scope of data available keeps changing with time and evolving of Facebook’s API and privacy settings. For instance we can no longer get details of all our friends from the API any longer. Libraries and Tools built on top of the API can tend to break with changes to Facebook’s APIs and this has happened before with Rfacebook as well as Netvizz. Besides this, Lada Adamic’s GetNet application has stopped working permanently ever since Facebook changed the way apps are created and the permissions they require. You can get more information about it here http://thepoliticsofsystems.net/2015/01/the-end-of-netvizz/ Thus what was used in the book today for data retrieval might not be working completely tomorrow if there are any changes in the APIs though it is expected it will be working fine for at least the next couple of years. However to prevent any hindrance on analyzing Facebook data, we have provided the datasets we used in most of our analyses except personal networks so that you can still follow along with each example and use-case. Personal names have been anonymized wherever possible to protect their privacy. Now that we have a good idea about Facebook’s Graph API and how to access data, let’s analyze some social networks! Analyzing your personal social network Like we had mentioned before, Facebook by itself is a massive social graph, connecting billions of users, brands and organization. Consider your own Facebook account if you have one. You will have several friends which are your immediate connections, they in turn will be having their own set of friends including you and you might be friends with some of them and so on. Thus you and your friends form the nodes of the network and edges determine the connections. In this section we will analyze a small network of you and your immediate friends and also look at how we can extract and analyze some properties from the network. Before we jump into our analysis, we will start by loading the necessary packages needed which are mentioned in the following snippet and storing the Facebook Graph API access token in a variable. library(Rfacebook) library(gridExtra) library(dplyr) # get the Graph API access token token = ‘XXXXXXXXXXX’ You can refer to the file fb_personal_network_analysis.R for code snippets used in the examples depicted in this section. Basic descriptive statistics In this section, we will try to get some basic information and descriptive statistics on the same from our personal social network on Facebook. To start with let us look at some details of our own profile on Facebook using the following code. # get my personal information me <- getUsers("me", token=token, private_info = TRUE) > View(me[c('name', 'id', 'gender', 'birthday')]) This shows us a few fields from the data frame containing our personal details retrieved from Facebook. We use the View function which basically invokes a spreadsheet-style data viewer on R objects like data frames. Now, let us get information about our friends in our personal network. Do note that Facebook currently only lets you access information about those friends who have allowed access to the Graph API and hence you may not be able to get information pertaining to all friends in your friend list. We have anonymized their names below for privacy reasons. anonymous_names <- c('Johnny Juel', 'Houston Tancredi',..., 'Julius Henrichs', 'Yong Sprayberry') # getting friends information friends <- getFriends(token, simplify=TRUE) friends$name <- anonymous_names # view top few rows > View(head(friends)) This gives us a peek at some people from our list of friends which we just retrieved from Facebook. Let’s now analyze some descriptive statistics based on personal information regarding our friends like where they are from, their gender and so on. # get personal information friends_info <- getUsers(friends$id, token, private_info = TRUE) # get the gender of your friends >View(table(friends_info$gender)) This gives us the gender of my friends, looks like more male friends have authorized access to the Graph API in my network! # get the location of your friends >View(table(friends_info$location)) This depicts the location of my friends (wherever available) in the following data frame. # get relationship status of your friends > View(table(friends_info$relationship_status)) From the statistics in the following table I can see that a lot of my friends have gotten married over the past couple of years. Boy that does make me feel old! Suppose I want to look at the relationship status of my friends grouped by gender, we can do the same using the following snippet. # get relationship status of friends grouped by gender View(table(friends_info$relationship_status, friends_info$gender)) The following table gives us the desired results and you can see the distribution of friends by their gender and relationship status. Summary This article has been proven very beneficial to know some basic analytics of social networks with the help of R. Moreover, you will also get to know the information regarding the packages that R use. Resources for Article: Further resources on this subject: How to integrate social media with your WordPress website [article] Social Media Insight Using Naive Bayes [article] Social Media in Magento [article]

0
0
4530

article-image-monitoring-logging-and-troubleshooting

Packt

20 Jun 2017

6 min read

Monitoring, Logging, and Troubleshooting

Packt

20 Jun 2017

6 min read

In this article by Gigi Sayfan, the author of the book Mastering Kubernetes, we will learn how to do the monitoring Kubernetes with Heapster. (For more resources related to this topic, see here.) Monitoring Kubernetes with Heapster Heapster is a Kubernetes project that provides a robust monitoring solution for Kubernetes clusters. It runs as a pod (of course), so it can be managed by Kubernetes itself. Heapster supports Kubernetes and CoreOS clusters. It has a very modular and flexible design. Heapster collects both operational metrics and events from every node in the cluster, stores them in a persistent backend (with a well-defined schema) and allows visualization and programmatic access. Heapster can be configured to use different backends (or sinks, in Heapster’s parlance) and their corresponding visualization frontends. The most common combination is InfluxDB as backend and Grafana as frontend. The Google Cloud platform integrates Heapster with the Google monitoring service. There are many other less common backends, such as the following: Log InfluxDB Google Cloud monitoring Google Cloud logging Hawkular-Metrics(metrics only) OpenTSDB Monasca (metrics only) Kafka (metrics only) Riemann (metrics only) Elasticsearch You can use multiple backends by specifying sinks on the command-line: --sink=log --sink=influxdb:http://monitoring-influxdb:80/ cAdvisor cAdvisor is part of the kubelet, which runs on every node. It collects information about the CPU/cores usage, memory, network,and file systems of each container. It provides a basic UI on port 4194, but, most importantly for Heapster, it provides all this information through the kubelet. Heapster records the information collected by cAdvisor on each node and stores it in its backend for analysis and visualization. The cAdvisor UI is useful if you want to quickly verify that a particular node is setup correctly, for example, while creating a new cluster when Heapster is not hooked up yet. Here is what it looks same as shown following: InfluxDB backend InfluxDB is a modern and robust distributed time-series database. It is very well-suited and used broadly for centralized metrics and logging. It is also the preferred Heapster backend (outside the Google Cloud platform). The only thing is InfluxDB clustering, high availability is part of enterprise offering. The storageschema The InfluxDB storage schema defines the information that Heapster stores in InfluxDB and is available for querying and graphing later. The metrics are divided into multiple categories, called measurements. You can treat and query each metric separately, or you can query a whole category as one measurement and receive the individual metrics as fields. The naming convention is <category>/<metrics name> (except for uptime, which has a single metric). If you have a SQL background you can think of measurements as tables. Each metrics are stored per container. Each metric is labeled with the following information: pod_id – Unique ID of a pod pod_name – User-provided name of a pod pod_namespace – The namespace of a pod container_base_image – Base image for the container container_name – User-provided name of the container or full cgroup name for system containers host_id – Cloud-provider-specified or user-specified Identifier of a node hostname – Hostname where the container ran labels – Comma-separated list of user-provided labels; format is key:value’ namespace_id – UID of the namespace of a pod resource_id – A unique identifier used to differentiate multiple metrics of the same type, for example, FS partitions under filesystem/usage Here are all the metrics grouped by category. As you can see, it is quite extensive. CPU cpu/limit – CPU hard limit in millicores cpu/node_capacity – CPU capacity of a node cpu/node_allocatable – CPU allocatable of a node cpu/node_reservation – Share of CPU that is reserved on the node allocatable cpu/node_utilization – CPU utilization as a share of node allocatable cpu/request – CPU request (the guaranteed amount of resources) in millicores cpu/usage – Cumulative CPU usage on all cores cpu/usage_rate – CPU usage on all cores in millicores File system filesystem/usage – Total number of bytes consumed on a filesystem filesystem/limit – The total size of the filesystem in bytes filesystem/available – The number of available bytes remaining in the filesystem Memory memory/limit – Memory hard limit in bytes memory/major_page_faults – Number of major page faults memory/major_page_faults_rate – Number of major page faults per second memory/node_capacity – Memory capacity of a node memory/node_allocatable – Memory allocatable of a node memory/node_reservation – Share of memory that is reserved on the node allocatable memory/node_utilization – Memory utilization as a share of memory allocatable memory/page_faults – Number of page faults memory/page_faults_rate – Number of page faults per second memory/request – Memory request (the guaranteed amount of resources) in bytes memory/usage – Total memory usage memory/working_set – Total working set usage; working set is the memory being used and not easily dropped by the kernel Network network/rx – Cumulative number of bytes received over the network network/rx_errors – Cumulative number of errors while receiving over the network network/rx_errors_rate – Number of errors per second while receiving over the network network/rx_rate – Number of bytes received over the network per second network/tx – Cumulative number of bytes sent over the network network/tx_errors – Cumulative number of errors while sending over the network network/tx_errors_rate – Number of errors while sending over the network network/tx_rate – Number of bytes sent over the network per second Uptime uptime – Number of milliseconds since the container was started You can work with InfluxDB directly if you’re familiar with it. You can either connect to it using its own API or use its web interface. Type the following command to find its port: k describe service monitoring-influxdb --namespace=kube-system | grep NodePort Type: NodePort NodePort: http 32699/TCP NodePort: api 30020/TCP Now you can browse to the InfluxDB web interface using the HTTP port. You’ll need to configure it to point to the API port. The username and password are root and root by default: Once you’re setup you can select what database to use (see top-right corner). The Kubernetes database is called k8s. You can now query the metrics using the InfluxDB query language. Grafana visualization Grafana runs in its own container and serves a sophisticated dashboard that works well with InfluxDB as a data source. To locate the port, type the following command: k describe service monitoring-influxdb --namespace=kube-system | grep NodePort Type: NodePort NodePort: <unset> 30763/TCP Now you can access the Grafana web interface on that port. The first thing you need to do is setup the data source to point to the InfluxDB backend: Make sure to test the connection and then go explore the various options in the dashboards. There are several default dashboards, but you should be able to customize it to your preferences. Grafana is designed to let adapt it to your needs. Summary In this article we have learned how to do monitoring Kubernetes with Heapster. Resources for Article: Further resources on this subject: The Microsoft Azure Stack Architecture [article] Building A Recommendation System with Azure [article] Setting up a Kubernetes Cluster [article]

0
0
21389

Packt

20 Jun 2017

14 min read

Introduction to NFRs

Packt

20 Jun 2017

14 min read

0
0
7250

article-image-understanding-basics-rxjava

Packt

20 Jun 2017

15 min read

Understanding the Basics of RxJava

Packt

20 Jun 2017

15 min read

0
0
14739

Packt

20 Jun 2017

11 min read

Scraping a Web Page

Packt

20 Jun 2017

11 min read

In this article by Katharine Jarmul author of the book Python Web Scraping - Second Edition we can look at some example as suppose I have a shop selling shoes and want to keep track of my competitor's prices. I could go to my competitor's website each day and compare each shoe's price with my own, however this will take a lot of time and will not scale well if I sell thousands of shoes or need to check price changes frequently. Or maybe I just want to buy a shoe when it's on sale. I could come back and check the shoe website each day until I get lucky, but the shoe I want might not be on sale for months. These repetitive manual processes could instead be replaced with an automated solution using the web scraping techniques covered in this book. In an ideal world, web scraping wouldn't be necessary and each website would provide an API to share the data in a structured format. Indeed, some websites do provide APIs, but they typically restrict the data that is available and how frequently it can be accessed. Additionally, a website developer might change, remove or restrict the backend API. In short, we cannot rely on APIs to access the online data we may want and therefore, we need to learn about web scraping techniques. (For more resources related to this topic, see here.) Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, first with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at (https://docs.python.org/3/howto/regex.html). Even if you use regular expressions (or regex) with another programming language, I recommend stepping through it for a refresher on regex with Python. To scrape the country area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> from advanced_link_crawler import download >>> url = 'http://example.webscraping.com/view/UnitedKingdom-239' >>> html = download(url) >>> re.findall(r'(.*?)', html) ['<'img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', 'EU', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', 'IE '] This result shows that thetag is used for multiple country attributes. If we simply wanted to scrape the country area, we can select the second matching element, as follows: >>> re.findall('(.*?)', html)[1]'244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if this table is changed and the area is no longer in the second matching element. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data at some point, we want our solution to be as robust against layout changes as possible. To make this regular expression more specific, we can include the parentelement, which has an ID, so it ought to be unique: >>> re.findall(' Area: (.*?) ', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra spaces could be added between the tags, or the area_label could be changed. Here is an improved version to try and support these various possibilities: >>> re.findall('''.*?<tds*class=["']w2p_fw["']>(.*?) ''', html) ['244,820 square kilometres'] This regular expression is more future-proof but is difficult to construct, and quite unreadable. Also, there are still plenty of other minor layout changes that would break it, such as if a title attribute was added to the <td> tag or if the tr or td elements changed their CSS classes or IDs. From this example, it is clear that regular expressions provide a quick way to scrape data but are too brittle and easily break when a web page is updated. Fortunately, there are better data extraction solutions such as. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have this module, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Many web pages do not contain perfectly valid HTML and Beautiful Soup needs to correct improper open and close tags. For example, consider this simple web page containing a list with missing attribute quotes and closing tags: <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print(fixed_html) <ul class="country"> <li> Area <li> Population </li> </li> </ul> We can see that using the default html.parser did not result in properly parsed HTML. We can see from the previous snippet that it has used nested li elements, which might make it difficult to navigate. Luckily there are more options for parsers. We can install LXML or we can also use html5lib. To install html5lib, simply use pip: pip install html5lib Now, we can repeat this code, changing only the parser like so: >>> soup = BeautifulSoup(broken_html, 'html5lib') >>> fixed_html = soup.prettify() >>> print(fixed_html) <html> <head> </head> <body> <ul class="country"> <li> Area </li> <li> Population </li> </ul> </body> </html> Here, BeautifulSoup using html5lib was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. You should see similar results if you used lxml. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the country area from our example website: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the data element >>> area = td.text # extract the text from the data element >>> print(area) 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. We also know if the page contains broken HTML that BeautifulSoup can help clean the page and allow us to extract data from very broken website code. Lxml Lxml is a Python library built on top of the libxml2 XML parsing library written in C, which helps make it faster than Beautiful Soup but also harder to install on some computers, specifically Windows. The latest installation instructions are available at http://lxml.de/installation.html. If you run into difficulties installing the library on your own, you can also use Anaconda to do so: https://anaconda.org/anaconda/lxml. If you are unfamiliar with Anaconda, it is a package and environment manager primarily focused on open data science packages built by the folks at Continuum Analytics. You can download and install Anaconda by following their setup instructions here: https://www.continuum.io/downloads. Note that using the Anaconda quick install will set your PYTHON_PATH to the Conda installation of Python. As with Beautiful Soup, the first step when using lxml is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> from lxml.html import fromstring, tostring >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = fromstring(broken_html) # parse the HTML >>> fixed_html = tostring(tree, pretty_print=True) >>> print(fixed_html) <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. These are not requirements for standard XML and so are unnecessary for lxml to insert. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here, because they are more compact and can be reused later when parsing dynamic content. Some readers will already be familiar with them from their experience with jQuery selectors or use in front-end web application development. We will compare performance of these selectors with XPath. To use CSS selectors, you might need to install the cssselect library like so: pip install cssselect Now we can use the lxml CSS selectors to extract the area data from the example page: >>> tree = fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print(area) 244,820 square kilometres By using the cssselect method on our tree, we can utilize CSS syntax to select a table row element with the places_area__row ID, and then the child table data tag with the w2p_fw class. Since cssselect returns a list, we then index the first result and call the text_content method, which will iterate over all child elements and return concatenated text of each element. In this case, we only have one element, but this functionality is useful to know for more complex extraction examples. Summary We have walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, so we will use it in future examples. Resources for Article: Further resources on this subject: Web scraping with Python (Part 2) [article] Scraping the Web with Python - Quick Start [article] Scraping the Data [article]

0
0
2431

article-image-manipulating-functions-functional-programming

Packt

20 Jun 2017

6 min read

Manipulating functions in functional programming

Packt

20 Jun 2017

6 min read

0
0
18827

Packt

20 Jun 2017

12 min read

What are Microservices?

Packt

20 Jun 2017

12 min read

0
0
30666

article-image-basics-python-absolute-beginners

Packt

19 Jun 2017

5 min read

Basics of Python for Absolute Beginners

Packt

19 Jun 2017

5 min read

In this article by Bhaskar Das and Mohit Raj, authors of the book, Learn Python in 7 days, we will learn basics of Python. The Python language had a humble beginning in the late 1980s when a Dutchman, Guido Von Rossum, started working on a fun project that would be a successor to the ABC language with better exception handling and capability to interface with OS Amoeba at Centrum Wiskunde and Informatica. It first appeared in 1991. Python 2.0 was released in the year 2000 and Python 3.0 was released in the year 2008. The language was named Python after the famous British television comedy show Monty Python's Flying Circus, which was one of the favorite television programmes of Guido. Here, we will see why Python has suddenly influenced our lives, various applications that use Python, and Python's implementations. In this article, you will be learning the basic installation steps required to perform on different platforms (that is Windows, Linux and Mac), about environment variables, setting up environment variables, file formats, Python interactive shell, basic syntaxes, and, finally, printing out the formatted output. (For more resources related to this topic, see here.) Why Python? Now you might be suddenly bogged with the question, why Python? According to the Institute of Electrical and Electronics Engineers (IEEE) 2016 ranking, Python ranked third after C and Java. As per Indeed.com's data of 2016, Python job market search ranked fifth. Clearly, all the data points to the ever-rising demand in the job market for Python. It's a cool language if you want to learn it just for fun. Also, you will adore the language if you want to build your career around Python. At the school level, many schools have started including Python programming for kids. With new technologies taking the market by surprise, Python has been playing a dominant role. Whether it's cloud platform, mobile app development, BigData, IoT with Raspberry Pi, or the new Blockchain technology, Python is being seen as a niche language platform to develop and deliver scalable and robust applications. Some key features of the language are: Python programs can run on any platform, you can carry code created in a Windows machine and run it on Mac or Linux Python has a large inbuilt library with prebuilt and portable functionality, known as the standard library Python is an expressive language Python is free and open source Python code is about one third of the size of equivalent C++ and Java code. Python can be both dynamically and strongly typed In dynamically typed, the type of a variable is interpreted at runtime, which means that there is no need to define the type (int, float) of a variable in Python Python applications One of the most famous platform where Python is extensively used is YouTube. Other places where you will find Python being extensively used are special effects in Hollywood movies, drug evolution and discovery, traffic control systems, ERP systems, cloud hosting, e-commerce platform, CRM systems, and whichever field you can think of. Versions At the time of writing this book, the two main versions of the Python programming language available in the market were Python 2.x and Python 3.x. The stable releases at the time of writing this book were Python 2.7.13 and Python 3.6.0. Implementations of Python Major implementations include CPython, Jython, IronPython, MicroPython and PyPy. Installation Here, we will look forward to the installation of Python on three different OS platforms, namely Windows, Linux, and Mac OS. Let's begin with the Windows platform. Installation on Windows platform Python 2.x can be downloaded from https://www.python.org/downloads. The installer is simple and easy to install. Follow these steps to install the setup: Once you click on the setup installer, you will get a small window on your desktop screen as shown. Click onNext: Provide a suitable installation folder to install Python. If you don't provide the installation folder, then the installer will automatically create an installation folder for you as shown in the screenshot shown. Click on Next: After the completion of Step 2, you will get a window to customize Python as shown in the following screenshot. Note that theAdd python.exe to Path option has been markedx. Select this option to add it to system path variable. Click onNext: Finally, clickFinish to complete the installation: Summary So far, we did a walk through on the beginning and brief history of Python. We looked at the various implementations and flavors of Python. You also learned about installing on Windows OS. Hope this article has incited enough interest in Python and serves as your first step in the kingdom of Python, with enormous possibilities! Resources for Article: Further resources on this subject: Layout Management for Python GUI [article] Putting the Fun in Functional Python [article] Basics of Jupyter Notebook and Python [article]

0
0
12415

Packt

19 Jun 2017

15 min read

Understanding the Basics of Gulp

Packt

19 Jun 2017

15 min read

In this article written by Travis Maynard, author of the book Getting Started with Gulp - Second Edition, we will take a look at the basics of gulp and how it works. Understanding some of the basic principles and philosophies behind the tool, it's plugin system will assist you as you begin writing your own gulpfiles. We'll start by taking a look at the engine behind gulp and then follow up by breaking down the inner workings of gulp itself. By the end of this article, you will be prepared to begin writing your own gulpfile. (For more resources related to this topic, see here.) Installing node.js and npm As you learned in the introduction, node.js and npm are the engines that work behind the scenes that allow us to operate gulp and keep track of any plugins we decide to use. Downloading and installing node.js For Mac and Windows, the installation is quite simple. All you need to do is navigate over to http://nodejs.org and click on the big green install button. Once the installer has finished downloading, run the application and it will install both node.js and npm. For Linux, there are a couple more steps, but don't worry; with your newly acquired command-line skills, it should be relatively simple. To install node.js and npm on Linux, you'll need to run the following three commands in Terminal: sudo add-apt-repository ppa:chris-lea/node.js sudo apt-get update sudo apt-get install nodejs The details of these commands are outside the scope of this book, but just for reference, they add a repository to the list of available packages, update the total list of packages, and then install the application from the repository we added. Verify the installation To confirm that our installation was successful, try the following command in your command line: node -v If node.js is successfully installed, node -v will output a version number on the next line of your command line. Now, let's do the same with npm: npm -v Like before, if your installation was successful, npm -v should output the version number of npm on the next line. The versions displayed in this screenshot reflect the latest Long Term Support (LTS) release currently available as of this writing. This may differ from the version that you have installed depending on when you're reading this. It's always suggested that you use the latest LTS release when possible. The -v command is a common flag used by most command-line applications to quickly display their version number. This is very useful to debug version issues while using command-line applications. Creating a package.json file Having npm in our workflow will make installing packages incredibly easy; however, we should look ahead and establish a way to keep track of all the packages (or dependencies) that we use in our projects. Keeping track of dependencies is very important to keep your workflow consistent across development environments. Node.js uses a file named package.json to store information about your project, and npm uses this same file to manage all of the package dependencies your project requires to run properly. In any project using gulp, it is always a great practice to create this file ahead of time so that you can easily populate your dependency list as you are installing packages or plugins. To create the package.json file, we will need to run npm's built in init action using the following command: npm init Now, using the preceding command, the terminal will show the following output: Your command line will prompt you several times asking for basic information about the project, such as the project name, author, and the version number. You can accept the defaults for these fields by simply pressing the Enter key at each prompt. Most of this information is used primarily on the npm website if a developer decides to publish a node.js package. For our purposes, we will just use it to initialize the file so that we can properly add our dependencies as we move forward. The screenshot for the preceding command is as follows: Installing gulp With npm installed and our package.json file created, we are now ready to begin installing node.js packages. The first and most important package we will install is none other than gulp itself. Locating gulp Locating and gathering information about node.js packages is very simple, thanks to the npm registry. The npm registry is a companion website that keeps track of all the published node.js modules, including gulp and gulp plugins. You can find this registry at http://npmjs.org. Take a moment to visit the npm registry and do a quick search for gulp. The listing page for each node.js module will give you detailed information on each project, including the author, version number, and dependencies. Additionally, it also features a small snippet of command-line code that you can use to install the package along with readme information that will outline basic usage of the package and other useful information. Installing gulp locally Before we install gulp, make sure you are in your project's root directory, gulp-book, using the cd and ls commands you learned earlier. If you ever need to brush up on any of the standard commands, feel free to take a moment to step back and review as we progress through the book. To install packages with npm, we will follow a similar pattern to the ones we've used previously. Since we will be covering both versions 3.x and 4.x in this book, we'll demonstrate installing both: For installing gulp 3.x, you can use the following: npm install --save-dev gulp For installing gulp 4.x, you can use the following: npm install --save-dev gulpjs/gulp#4.0 This command is quite different from the 3.x command because this command is installing the latest development release directly from GitHub. Since the 4.x version is still being actively developed, this is the only way to install it at the time of writing this book. Once released, you will be able to run the previous command to without installing from GitHub. Upon executing the command, it will result in output similar to the following: To break this down, let's examine each piece of this command to better understand how npm works: npm: This is the application we are running install: This is the action that we want the program to run. In this case, we are instructing npm to install something in our local folder --save-dev: This is a flag that tells npm to add this module to the dev dependencies list in our package.json file gulp: This is the package we would like to install Additionally, npm has a –-save flag that saves the module to the list of dependencies instead of devDependencies. These dependency lists are used to separate the modules that a project depends on to run, and the modules a project depends on when in development. Since we are using gulp to assist us in development, we will always use the --save-dev flag throughout the book. So, this command will use npm to contact the npm registry, and it will install gulp to our local gulp-book directory. After using this command, you will note that a new folder has been created that is named node_modules. It is where node.js and npm store all of the installed packages and dependencies of your project. Take a look at the following screenshot: Installing gulp-cli globally For many of the packages that we install, this will be all that is needed. With gulp, we must install a companion module gulp-cli globally so that we can use the gulp command from anywhere in our filesystem. To install gulp-cli globally, use the following command: npm install -g gulp-cli In this command, not much has changed compared to the original command where we installed the gulp package locally. We've only added a -g flag to the command, which instructs npm to install the package globally. On Windows, your console window should be opened under an administrator account in order to install an npm package globally. At first, this can be a little confusing, and for many packages it won't apply. Similar build systems actually separate these usages into two different packages that must be installed separately; once that is installed globally for command-line use and another installed locally in your project. Gulp was created so that both of these usages could be combined into a single package, and, based on where you install it, it could operate in different ways. Anatomy of a gulpfile Before we can begin writing tasks, we should take a look at the anatomy and structure of a gulpfile. Examining the code of a gulpfile will allow us to better understand what is happening as we run our tasks. Gulp started with four main methods:.task(), .src(), .watch(), and .dest(). The release of version 4.x introduced additional methods such as: .series() and .parallel(). In addition to the gulp API methods, each task will also make use of the node.js .pipe() method. This small list of methods is all that is needed to understand how to begin writing basic tasks. They each represent a specific purpose and will act as the building blocks of our gulpfile. The task() method The .task() method is the basic wrapper for which we create our tasks. Its syntax is .task(string, function). It takes two arguments—string value representing the name of the task and a function that will contain the code you wish to execute upon running that task. The src() method The .src() method is our input or how we gain access to the source files that we plan on modifying. It accepts either a single glob string or an array of glob strings as an argument. Globs are a pattern that we can use to make our paths more dynamic. When using globs, we can match an entire set of files with a single string using wildcard characters as opposed to listing them all separately. The syntax is for this method is .src(string || array). The watch() method The .watch() method is used to specifically look for changes in our files. This will allow us to keep gulp running as we code so that we don't need to rerun gulp any time we need to process our tasks. This syntax is different between the 3.x and 4.x version. For version 3.x the syntax is—.watch(string || array, array) with the first argument being our paths/globs to watch and the second argument being the array of task names that need to be run when those files change. For version 4.x the syntax has changed a bit to allow for two new methods that provide more explicit control of the order in which tasks are executed. When using 4.x instead of passing in an array as the second argument, we will use either the .series() or .parallel() method like so—.watch(string || array, gulp.series() || gulp.parallel()). The dest() method The dest() method is used to set the output destination of your processed file. Most often, this will be used to output our data into a build or dist folder that will be either shared as a library or accessed by your application. The syntax for this method is .dest(string). The pipe() method The .pipe() method will allow us to pipe together smaller single-purpose plugins or applications into a pipechain. This is what gives us full control of the order in which we would need to process our files. The syntax for this method is .pipe(function). The parallel() and series() methods The parallel and series methods were added in version 4.x as a way to easily control whether your tasks are run together all at once or in a sequence one after the other. This is important if one of your tasks requires that other tasks complete before it can be ran successfully. When using these methods the arguments will be the string names of your tasks separated by a comma. The syntax for these methods is—.series(tasks) and .parallel(tasks); Understanding these methods will take you far, as these are the core elements of building your tasks. Next, we will need to put these methods together and explain how they all interact with one another to create a gulp task. Including modules/plugins When writing a gulpfile, you will always start by including the modules or plugins you are going to use in your tasks. These can be both gulp plugins or node.js modules, based on what your needs are. Gulp plugins are small node.js applications built for use inside of gulp to provide a single-purpose action that can be chained together to create complex operations for your data. Node.js modules serve a broader purpose and can be used with gulp or independently. Next, we can open our gulpfile.js file and add the following code: // Load Node Modules/Plugins var gulp = require('gulp'); var concat = require('gulp-concat'); var uglify = require('gulp-uglify'); The gulpfile.js file will look as shown in the following screenshot: In this code, we have included gulp and two gulp plugins: gulp-concat and gulp-uglify. As you can now see, including a plugin into your gulpfile is quite easy. After we install each module or plugin using npm, you simply use node.js' require() function and pass it in the name of the module. You then assign it to a new variable so that you can use it throughout your gulpfile. This is node.js' way of handling modularity, and because a gulpfile is essentially a small node.js application, it adopts this practice as well. Writing a task All tasks in gulp share a common structure. Having reviewed the five methods at the beginning of this section, you will already be familiar with most of it. Some tasks might end up being larger than others, but they still follow the same pattern. To better illustrate how they work, let's examine a bare skeleton of a task. This skeleton is the basic bone structure of each task we will be creating. Studying this structure will make it incredibly simple to understand how parts of gulp work together to create a task. An example of a sample task is as follows: gulp.task(name, function() { return gulp.src(path) .pipe(plugin) .pipe(plugin) .pipe(gulp.dest(path)); }); In the first line, we use the new gulp variable that we created a moment ago and access the .task() method. This creates a new task in our gulpfile. As you learned earlier, the task method accepts two arguments: a task name as a string and a callback function that will contain the actions we wish to run when this task is executed. Inside the callback function, we reference the gulp variable once more and then use the .src() method to provide the input to our task. As you learned earlier, the source method accepts a path or an array of paths to the files that we wish to process. Next, we have a series of three .pipe() methods. In each of these pipe methods, we will specify which plugin we would like to use. This grouping of pipes is what we call our pipechain. The data that we have provided gulp with in our source method will flow through our pipechain to be modified by each piped plugin that it passes through. The order of the pipe methods is entirely up to you. This gives you a great deal of control in how and when your data is modified. You may have noticed that the final pipe is a bit different. At the end of our pipechain, we have to tell gulp to move our modified file somewhere. This is where the .dest() method comes into play. As we mentioned earlier, the destination method accepts a path that sets the destination of the processed file as it reaches the end of our pipechain. If .src() is our input, then .dest() is our output. Reflection To wrap up, take a moment to look at a finished gulpfile and reflect on the information that we just covered. This is the completed gulpfile that we will be creating from scratch, so don't worry if you still feel lost. This is just an opportunity to recognize the patterns and syntaxes that we have been studying so far. We will begin creating this file step by step: // Load Node Modules/Plugins var gulp = require('gulp'); var concat = require('gulp-concat'); var uglify = require('gulp-uglify'); // Process Styles gulp.task('styles', function() { return gulp.src('css/*.css') .pipe(concat('all.css')) .pipe(gulp.dest('dist/')); }); // Process Scripts gulp.task('scripts', function() { return gulp.src('js/*.js') .pipe(concat('all.js')) .pipe(uglify()) .pipe(gulp.dest('dist/')); }); // Watch Files For Changes gulp.task('watch', function() { gulp.watch('css/*.css', 'styles'); gulp.watch('js/*.js', 'scripts'); }); // Default Task gulp.task('default', gulp.parallel('styles', 'scripts', 'watch')); The gulpfile.js file will look as follows: Summary In this article, you installed node.js and learned the basics of how to use npm and understood how and why to install gulp both locally and globally. We also covered some of the core differences between the 3.x and 4.x versions of gulp and how they will affect your gulpfiles as we move forward. To wrap up the article, we took a small glimpse into the anatomy of a gulpfile to prepare us for writing our own gulpfiles from scratch. Resources for Article: Further resources on this subject: Performing Task with Gulp [article] Making a Web Server in Node.js [article] Developing Node.js Web Applications [article]

0
0
17807

How-To Tutorials

article-image-overview-important-concepts-microsoft-dynamics-nav-2016

Packt

19 Jun 2017

15 min read

Overview of Important Concepts of Microsoft Dynamics NAV 2016

Packt

19 Jun 2017

15 min read

0
0
1999

How-To Tutorials

article-image-introduction-cyber-extortion

Packt

19 Jun 2017

21 min read

Introduction to Cyber Extortion

Packt

19 Jun 2017

21 min read

In this article by Dhanya Thakkar, the author of the book Preventing Digital Extortion, explains how often we make the mistake of relying on the past for predicting the future, and nowhere is this more relevant than in the sphere of the Internet and smart technology. People, processes, data, and things are tightly and increasingly connected, creating new, intelligent networks unlike anything else we have seen before. The growth is exponential and the consequences are far reaching for individuals, and progressively so for businesses. We are creating the Internet of Things and the Internet of Everything. (For more resources related to this topic, see here.) It has become unimaginable to run a business without using the Internet. It is not only an essential tool for current products and services, but an unfathomable well for innovation and fresh commercial breakthroughs. The transformative revolution is spillinginto the public sector, affecting companies like vanguards and diffusing to consumers, who are in a feedback loop with suppliers, constantly obtaining and demanding new goods. Advanced technologies that apply not only to machine-to-machine communication but also to smart sensors generate complex networks to which theoretically anything that can carry a sensor can be connected. Cloud computing and cloud-based applications provide immense yet affordable storage capacity for people and organizations and facilitate the spread of data in more ways than one. Keeping in mind the Internet’s nature, the physical boundaries of business become blurred, and virtual data protection must incorporate a new characteristic of security: encryption. In the middle of the storm of the IoT, major opportunities arise, and equally so, unprecedented risks lurk. People often think that what they put on the Internet is protected and closed information. It is hardly so. Sending an e-mail is not like sending a letter in a closed envelope. It is more like sending a postcard, where anyone who gets their hands on it can read what's written on it. Along with people who want to utilize the Internet as an open business platform, there are people who want to find ways of circumventing legal practices and misusing the wealth of data on computer networks by unlawfully gaining financial profits, assets, or authority that can be monetized. Being connected is now critical. As cyberspace is growing, so are attempts to violate vulnerable information gaining global scale. This newly discovered business dynamic is under persistent threat of criminals. Cyberspace, cyber crime, and cyber security are perceptibly being found in the same sentence. Cyber crime –under defined and under regulated A massive problem encouraging the perseverance and evolution of cyber crime is the lack of an adequate unanimous definition and the under regulation on a national, regional, and global level. Nothing is criminal unless stipulated by the law. Global law enforcement agencies, academia, and state policies have studied the constant development of the phenomenon since its first appearance in 1989, in the shape of the AIDS Trojan virus transferred from an infected floppy disk. Regardless of the bizarre beginnings, there is nothing entertaining about cybercrime. It is serious. It is dangerous. Significant efforts are made to define cybercrime on a conceptual level in academic research and in national and regional cybersecurity strategies. Still, as the nature of the phenomenon evolves, so must the definition. Research reports are still at a descriptive level, and underreporting is a major issue. On the other hand, businesses are more exposed due to ignorance of the fact that modern-day criminals increasingly rely on the Internet to enhance their criminal operations. Case in point: Aaushi Shah and Srinidhi Ravi from the Asian School of Cyber Laws have created a cybercrime list by compiling a set of 74 distinctive and creativelynamed actions emerging in the last three decades that can be interpreted as cybercrime. These actions target anything from e-mails to smartphones, personal computers, and business intranets: piggybacking, joe jobs, and easter eggs may sound like cartoons, but their true nature resembles a crime thriller. The concept of cybercrime Cyberspace is a giant community made out of connected computer users and data on a global level. As a concept, cybercrime involves any criminal act dealing withcomputers andnetworks, including traditional crimes in which the illegal activities are committed through the use of a computer and the Internet. As businesses become more open and widespread, the boundary between data freedom and restriction becomes more porous. Countless e-shopping transactions are made, hospitals keep record of patient histories, students pass exams, and around-the-clock payments are increasingly processed online. It is no wonder that criminals are relentlessly invading cyberspace trying to find a slipping crack. There are no recognizable border controls on the Internet, but a business that wants to evade harm needs to understand cybercrime's nature and apply means to restrict access to certain information. Instead of identifying it as a single phenomenon, Majid Jar proposes a common denominator approach for all ICT-related criminal activities. In his book Cybercrime and Society, Jar refers to Thomas and Loader’s working concept of cybercrime as follows: “Computer-mediated activities which are either illegal or considered illicit by certain parties and which can be conducted through global electronic network.” Jar elaborates the important distinction of this definition by emphasizing the difference between crime and deviance. Criminal activities are explicitly prohibited by formal regulations and bear sanctions, while deviances breach informal social norms. This is a key note to keep in mind. It encompasses the evolving definition of cybercrime, which keeps transforming after resourceful criminals who constantly think of new ways to gain illegal advantages. Law enforcement agencies on a global level make an essential distinction between two subcategories of cybercrime: Advanced cybercrime or high-tech crime Cyber-enabled crime The first subcategory, according to Interpol, includes newly emerged sophisticated attacks against computer hardware and software. On the other hand, the second category contains traditional crimes in modern clothes,for example crimes against children, such as exposing children to illegal content; financial crimes, such as payment card frauds, money laundering, and counterfeiting currency and security documents; social engineering frauds; and even terrorism. We are much beyond the limited impact of the 1989 cybercrime embryo. Intricate networks are created daily. They present new criminal opportunities, causing greater damage to businesses and individuals, and require a global response. Cybercrime is conceptualized as a service embracing a commercial component.Cybercriminals work as businessmen who look to sell a product or a service to the highest bidder. Critical attributes of cybercrime An abridged version of the cybercrime concept provides answers to three vital questions: Where are criminal activities committed and what technologies are used? What is the reason behind the violation? Who is the perpetrator of the activities? Where and how – realm Cybercrime can be an online, digitally committed, traditional offense. Even if the component of an online, digital, or virtual existence were not included in its nature, it would still have been considered crime in the traditional, real-world sense of the word. In this sense, as the nature of cybercrime advances, so mustthe spearheads of lawenforcement rely on laws written for the non-digital world to solve problems encountered online. Otherwise, the combat becomesstagnant and futile. Why – motivation The prefix "cyber" sometimes creates additional misperception when applied to the digital world. It is critical to differentiate cybercrime from other malevolent acts in the digital world by considering the reasoning behind the action. This is not only imperative for clarification purposes, but also for extending the definition of cybercrime over time to include previously indeterminate activities. Offenders commit a wide range of dishonest acts for selfish motives such as monetary gain, popularity, or gratification. When the intent behind the behavior is misinterpreted, confusion may arise and actions that should not have been classified as cybercrime could be charged with criminal prosecution. Who –the criminal deed component The action must be attributed to a perpetrator. Depending on the source, certain threats can be translated to the criminal domain only or expanded to endanger potential larger targets, representing an attack to national security or a terrorist attack. Undoubtedly, the concept of cybercrime needs additional refinement, and a comprehensive global definition is in progress. Along with global cybercrime initiatives, national regulators are continually working on implementing laws, policies, and strategies to exemplify cybercrime behaviors and thus strengthen combating efforts. Types of common cyber threats In their endeavors to raise cybercrime awareness, the United Kingdom'sNational Crime Agency (NCA) divided common and popular cybercrime activities by affiliating themwith the target under threat. While both individuals and organizations are targets of cyber criminals, it is the business-consumer networks that suffer irreparable damages due to the magnitude of harmful actions. Cybercrime targeting consumers Phishing The term encompasses behavior where illegitimate e-mails are sent to the receiver to collect security information and personal details Webcam manager A webcam manager is an instance of gross violating behavior in which criminals take over a person's webcam File hijacker Criminals hijack files and hold them "hostage" until the victim pays the demanded ransom Keylogging With keylogging, criminals have the means to record what the text behind the keysyou press on your keyboard is Screenshot manager A screenshot manager enables criminals to take screenshots of an individual’s computer screen Ad clicker Annoying but dangerous ad clickers direct victims’ computer to click on a specific harmful link Cybercrime targeting businesses Hacking Hacking is basically unauthorized access to computer data. Hackers inject specialist software with which they try to take administrative control of a computerized network or system. If the attack is successful, the stolen data can be sold on the dark web and compromise people’s integrity and safety by intruding and abusing the privacy of products as well as sensitive personal and business information. Hacking is particularly dangerous when it compromises the operation of systems that manage physical infrastructure, for example, public transportation. Distributed denial of service (DDoS) attacks When an online service is targeted by a DDoS attack, the communication links overflow with data from messages sent simultaneously by botnets. Botnets are a bunch of controlled computers that stop legitimate access to online services for users. The system is unable to provide normal access as it cannot handle the huge volume of incoming traffic. Cybercrime in relation to overall computer crime Many moons have passed since 2001, when the first international treatythat targeted Internet and computer crime—the Budapest Convention on Cybercrime—was adopted. The Convention’s intention was to harmonize national laws, improve investigative techniques, and increase cooperation among nations. It was drafted with the active participation of the Council of Europe's observer states Canada, Japan, South Africa, and the United States and drawn up by the Council of Europe in Strasbourg, France. Brazil and Russia, on the other hand, refused to sign the document on the basis of not being involved in the Convention's preparation. In The Understanding Cybercrime: A Guide to Developing Countries(Gercke, 2011), Marco Gercke makes an excellent final point: “Not all computer-related crimes come under the scope of cybercrime. Cybercrime is a narrower notion than all computer-related crime because it has to include a computer network. On the other hand, computer-related crime in general can also affect stand-alone computer systems.” Although progress has been made, consensus over the definition of cybercrime is not final. Keeping history in mind, a fluid and developing approach must be kept in mind when applying working and legal interpretations. In the end, international noncompliance must be overcome to establish a common and safe ground to tackle persistent threats. Cybercrime localized – what is the risk in your region? Europol’s heat map for the period between 2014 and 2015 reports on the geographical distribution of cybercrime on the basis of the United Nations geoscheme. The data in the report encompassed cyber-dependent crime and cyber-enabled fraud, but it did not include investigations into online child sexual abuse. North and South America Due to its overwhelming presence, it is not a great surprise that the North American region occupies several lead positions concerning cybercrime, both in terms of enabling malicious content and providing residency to victims in the regions that participate in the global cybercrime numbers. The United States hosted between 20% and nearly 40% of the total world's command-and-control servers during 2014. Additionally, the US currently hosts over 45% of the world's phishing domains and is in the pack of world-leading spam producers. Between 16% and 20% percent of all global bots are located in the United States, while almost a third of point-of-sale malware and over 40% of all ransomware incidents were detected there. Twenty EU member states have initiated criminal procedures in which the parties under suspicion were located in the United States. In addition, over 70 percent of the countries located in the Single European Payment Area have been subject to losses from skimmed payment cards because of the distinct way in which the US, under certain circumstances, processes card payments without chip-and-PIN technology. There are instances of cybercrime in South America, but the scope of participation by the southern continent is way smaller than that of its northern neighbor, both in industry reporting and in criminal investigations. Ecuador, Guatemala, Bolivia, Peru, and Brazil are constantly rated high on the malware infection scale, and the situation is not changing, while Argentina and Colombia remain among the top 10 spammer countries. Brazil has a critical role in point-of-sale malware, ATM malware, and skimming devices. Europe The key aspect making Europe a region with excellent cybercrime potential is the fast, modern, and reliable ICT infrastructure. According to The Internet Organised Crime Threat Assessment (IOCTA) 2015, Cybercriminals abuse Western European countries to host malicious content and launch attacks inside and outside the continent. EU countries host approximately 13 percent of the global malicious URLs, out of which Netherlands is the leading country, while Germany, the U.K., and Portugal come second, third, and fourth respectively. Germany, the U.K., the Netherlands, France, and Russia are important hosts for bot C&C infrastructure and phishing domains, while Italy, Germany, the Netherlands, Russia, and Spain are among the top sources of global spam. Scandinavian countries and Finland are famous for having the lowest malware infection rates. France, Germany, Italy, and to some extent the U.K. have the highest malware infection rates and the highest proportion of bots found within the EU. However, the findings are presumably the result of the high population of the aforementioned EU countries. A half of the EU member states identified criminal infrastructure or suspects in the Netherlands, Germany, Russia, or the United Kingdom. One third of the European law enforcement agencies confirmed connections to Austria, Belgium, Bulgaria, the Czech Republic, France, Hungary, Italy, Latvia, Poland, Romania, Spain, or Ukraine. Asia China is the United States' counterpart in Asia in terms of the top position concerning reported threats to Internet security. Fifty percent of the EU member states' investigations on cybercrime include offenders based in China. Moreover, certain authorities quote China as the source of one third of all global network attacks. In the company of India and South Korea, China is third among the top-10 countries hosting botnet C&C infrastructure, and it has one of the highest global malware infection rates. India, Indonesia, Malaysia, Taiwan, and Japan host serious bot numbers, too. Japan takes on a significant part both as a source country and as a victim of cybercrime. Apart from being an abundant spam source, Japan is included in the top three Asian countries where EU law enforcement agencies have identified cybercriminals. On the other hand, Japan, along with South Korea and the Philippines, is the most popular country in the East and Southeast region of Asia where organized crime groups run sextortion campaigns. Vietnam, India, and China are the top Asian countries featuring spamming sources. Alternatively, China and Hong Kong are the most prominent locations for hosting phishing domains. From another point of view, the country code top-level domains (ccTLDs) for Thailand and Pakistan are commonly used in phishing attacks. In this region, most SEPA members reported losses from the use of skimmed cards. In fact, five (Indonesia, Philippines, South Korea, Vietnam, and Malaysia) out of the top six countries are from this region. Africa Africa remains renowned for combined and sophisticated cybercrime practices. Data from the Europol heat map report indicates that the African region holds a ransomware-as-a-service presence equivalent to the one of the European black market. Cybercriminals from Africa make profits from the same products. Nigeria is on the list of the top 10 countries compiled by the EU law enforcement agents featuring identified cybercrime perpetrators and related infrastructure. In addition, four out of the top five top-level domains used for phishing are of African origin: .cf, .za, .ga, and .ml. Australia and Oceania Australia has two critical cybercrime claims on a global level: First, the country is present in several top-10 charts in the cybersecurity industry, including bot populations, ransomware detection, and network attack originators. Second, the country-code top-level domain for the Palau Islands in Micronesia is massively used by Chinese attackers as the TLD with the second highest proportion of domains used for phishing. Cybercrime in numbers Experts agree that the past couple of years have seen digital extortion flourishing. In 2015 and 2016, cybercrime reached epic proportions. Although there is agreement about the serious rise of the threat, putting each ransomware aspect into numbers is a complex issue. Underreporting is not an issue only in academic research but also in practical case scenarios. The threat to businesses around the world is growing, because businesses keep it quiet. The scope of extortion is obscured because companies avoid reporting and pay the ransom in order to settle the issue in a conducive way. As far as this goes for corporations, it is even more relevant for public enterprises or organizations that provide a public service of any kind. Government bodies, hospitals, transportation companies, and educational institutions are increasingly targeted with digital extortion. Cybercriminals estimate that these targets are likely to pay in order to protect drops in reputation and to enable uninterrupted execution of public services. When CEOs and CIOs keep their mouths shut, relying on reported cybercrime numbers can be a tricky question. The real picture is not only what is visible in the media or via professional networking, but also what remains hidden and is dealt with discreetly by the security experts. In the second quarter of 2015, Intel Security reported an increase in ransomware attacks by 58%. Just in the first 3 months of 2016, cybercriminals amassed $209 million from digital extortion. By making businesses and authorities pay the relatively small average ransom amount of $10,000 per incident, extortionists turn out to make smart business moves. Companies are not shaken to the core by this amount. Furthermore, they choose to pay and get back to business as usual, thus eliminating further financial damages that may arise due to being out of business and losing customers. Extortionists understand the nature of ransom payment and what it means for businesses and institutions. As sound entrepreneurs, they know their market. Instead of setting unreasonable skyrocketing prices that may cause major panic and draw severe law enforcement action, they keep it low profile. In this way, they maintain the dark business in flow, moving from one victim to the next and evading legal measures. A peculiar perspective – Cybercrime in absolute and normalized numbers “To get an accurate picture of the security of cyberspace, cybercrime statistics need to be expressed as a proportion of the growing size of the Internet similar to the routine practice of expressing crime as a proportion of a population, i.e., 15 murders per 1,000 people per year.” This statement by Eric Jardine from the Global Commission on Internet Governance (Jardine, 2015) launched a new perspective of cybercrime statistics, one that accounts for the changing nature and size of cyberspace. The approach assumes that viewing cybercrime findings isolated from the rest of the changes in cyberspace provides a distorted view of reality. The report aimed at normalizing crime statistics and thus avoiding negative, realistic cybercrime scenarios that emerge when drawing conclusions from the limited reliability of absolute numbers. In general, there are three ways in which absolute numbers can be misinterpreted: Absolute numbers can negatively distort the real picture, while normalized numbers show whether the situation is getting better Both numbers can show that things are getting better, but normalized numbers will show that the situation is improving more quickly Both numbers can indicate that things are deteriorating, but normalized numbers will indicate that the situation is deteriorating at a slower rate than absolute numbers Additionally, the GCIG (Global Commission on Internet Governance) report includes some excellent reasoning about the nature of empirical research undertaken in the age of the Internet. While almost everyone and anything is connected to the network and data can be easily collected, most of the information is fragmented across numerous private parties. Normally, this entangles the clarity of the findings of cybercrime presence in the digital world. When data is borrowed from multiple resources and missing slots are modified with hypothetical numbers, the end result can be skewed. Keeping in mind this observation, it is crucial to emphasize that the GCIG report measured the size of cyberspace by accounting for eight key aspects: The number of active mobile broadband subscriptions The number of smartphones sold to end users The number of domains and websites The volume of total data flow The volume of mobile data flow The annual number of Google searches The Internet’s contribution to GDP It has been illustrated several times during this introduction that as cyberspace grows, so does cybercrime. To fight the menace, businesses and individuals enhance security measures and put more money into their security budgets. A recent CIGI-Ipsos (Centre for International Governance Innovation - Ipsos) survey collected data from 23,376 Internet users in 24 countries, including Australia, Brazil, Canada, China, Egypt, France, Germany, Great Britain, Hong Kong, India, Indonesia, Italy, Japan, Kenya, Mexico, Nigeria, Pakistan, Poland, South Africa, South Korea, Sweden, Tunisia, Turkey, and the United States. Survey results showed that 64% of users were more concerned about their online privacy compared to the previous year, whereas 78% were concerned about having their banking credentials hacked. Additionally, 77% of users were worried about cyber criminals stealing private images and messages. These perceptions led to behavioral changes: 43% of users started avoiding certain sites and applications, some 39% regularly updated passwords, while about 10% used the Internet less (CIGI-Ipsos, 2014). GCIC report results are indicative of a heterogeneous cyber security picture. Although many cybersecurity aspects are deteriorating over time, there are some that are staying constant, and a surprising number are actually improving. Jardine compares cyberspace security to trends in crime rates in a specific country operationalizing cyber attacks via 13 measures presented in the following table, as seen in Table 2 of Summary Statistics for the Security of Cyberspace(E. Jardine, GCIC Report, p. 6): Minimum Maximum Mean Standard Deviation New Vulnerabilities 4,814 6,787 5,749 781.880 Malicious Web Domains 29,927 74,000 53,317 13,769.99 Zero-day Vulnerabilities 8 24 14.85714 6.336 New Browser Vulnerabilities 232 891 513 240.570 Mobile Vulnerabilities 115 416 217.35 120.85 Botnets 1,900,000 9,437,536 4,485,843 2,724,254 Web-based Attacks 23,680,646 1,432,660,467 907,597,833 702,817,362 Average per Capita Cost 188 214 202.5 8.893818078 Organizational Cost 5,403,644 7,240,000 6,233,941 753,057 Detection and Escalation Costs 264,280 455,304 372,272 83,331 Response Costs 1,294,702 1,738,761 1,511,804 152,502.2526 Lost Business Costs 3,010,000 4,592,214 3,827,732 782,084 Victim Notification Costs 497,758 565,020 565,020 30,342 While reading the table results, an essential argument must be kept in mind. Statistics for cybercrime costs are not available worldwide. The author worked with the assumption that data about US costs of cybercrime indicate costs on a global level. For obvious reasons, however, this assumption may not be true, and many countries will have had significantly lower costs than the US. To mitigate the assumption's flaws, the author provides comparative levels of those measures. The organizational cost of data breaches in 2013 in the United States was a little less than six million US dollars, while the average number on the global level, which was drawn from the Ponemon Institute’s annual Cost of Data Breach Study (from 2011, 2013, and 2014 via Jardine, p.7) measured the overall cost of data breaches, including the US ones, as US$2,282,095. The conclusion is that US numbers will distort global cost findings by expanding the real costs and will work against the paper's suggestion, which is that normalized numbers paint a rosier picture than the one provided by absolute numbers. Summary In this article, we have covered the birth and concept of cyber crime and the challenges law enforcement, academia, and security professionals face when combating its threatening behavior. We also explored the impact of cyber crime by numbers on varied geographical regions, industries, and devices. Resources for Article: Further resources on this subject: Interactive Crime Map Using Flask [article] Web Scraping with Python [article]

0
0
15520

Tangled Web? Not At All!

Setting up Intel Edison

CORS in Node.js

Grouping Sets in Advanced SQL

Analyzing Social Networks with Facebook

Monitoring, Logging, and Troubleshooting

Introduction to NFRs

Understanding the Basics of RxJava

Scraping a Web Page

Manipulating functions in functional programming

Trending Topics

What are Microservices?

Basics of Python for Absolute Beginners

Understanding the Basics of Gulp

Overview of Important Concepts of Microsoft Dynamics NAV 2016

Introduction to Cyber Extortion

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access