Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-7-tips-for-using-git-and-github-the-right-way
Sugandha Lahoti
06 Oct 2018
3 min read
Save for later

7 tips for using Git and GitHub the right way

Sugandha Lahoti
06 Oct 2018
3 min read
GitHub has become a widely accepted and integral part of software development owing to the imperative features of change tracking that it offers. It was created in 2005 by Linus Torvalds to support the development of the Linux kernel. In this post, Alex Magana and Joseph Muli, the authors of Introduction to Git and GitHub course, discuss some of the best practices you should keep in mind while learning or using Git and GitHub. Document everything A good best practice that eases work in any team is ample documentation. Documenting something as simple as a repository goes a long way in presenting work and attracting contributors. It’s more of a first impression aspect when it comes to looking for a tool to aid in development. Utilize the README.MD and wikis One should also utilize the README.MD and wikis to elucidate the functionality delivered by the application and categorize guide topics and material. You should Communicate the solution of the code in the repository avails. Specify the guidelines and rules of engagement that govern contributing to the codebase. Indicate the dependencies required to set-up the working environment. Stipulate set-up instructions to get a working version of the application in a contributor’s local environment. Keep simple and concise naming conventions Naming conventions are also highly encouraged when it comes to repositories and branches. They should be simple, descriptive and concise. For instance, a repository that houses code intended for a Git course could be simply named as “git-tutorial-material”. As a learner on the bespoke course, it’s easier for a user to get the material, compared to a repository with a name such as “material”. Adopt naming prefixes You should also adopt institute naming prefixes for different task types for branch naming. For example, you may use feat- for feature branches, bug- for bugs and fix- for fix branches. Also, make use of templates that encompass a checklist for Pull Requests. Correspond a PR and Branch to a ticket or task A PR and Branch should correspond to a ticket or task on the project management board. This aids in aligning efforts employed on a product to the appropriate milestones and product vision. Organize and track tasks using issues Tasks such as technical debts, bugs, and workflow improvements should be organized and tracked using Issues. You should also enforce Push and Pull restrictions on the default branch and use webhooks to automate deployment and run pre-merge test suites. Use atomic commits Another best practice is to use atomic commits. An atomic commit is an operation that applies a set of distinct changes as a single operation. You should persist changes in small changesets and use descriptive and concise commit messages to record effected changes. You read a guest post from Alex Magana and Joseph Muli, the authors of Introduction to Git and GitHub. We hope that these best practices help you manage your Git and GitHub more smoothly. Don’t forget to check out Alex and Joseph’s Introduction to Git and GitHub course to learn how to create and enforce checks and controls for the introduction, scrutiny, approval, merging, and reversal of changes. GitHub introduces ‘Experiments’, a platform to share live demos of their research projects. Packt’s GitHub portal hits 2,000 repositories
Read more
  • 0
  • 0
  • 31138

article-image-handling-backup-and-recovery-in-postgresql-10
Savia Lobo
18 Jun 2018
11 min read
Save for later

Handling backup and recovery in PostgreSQL 10 [Tutorial]

Savia Lobo
18 Jun 2018
11 min read
Performing backups should be a regular task and every administrator is supposed to keep an eye on this vital stuff. Fortunately, PostgreSQL provides an easy means to create backups. In this tutorial, you will learn how to backup data by performing some simple dumps and also recover them using PostgreSQL 10. This article is an excerpt taken from, 'Mastering PostgreSQL 10' written by Hans-Jürgen Schönig. This book highlights the newly introduced features in PostgreSQL 10, and shows you how to build better PostgreSQL applications. Performing simple dumps If you are running a PostgreSQL setup, there are two major methods to perform backups: Logical dumps (extract an SQL script representing your data) Transaction log shipping The idea behind transaction log shipping is to archive binary changes made to the database. Most people claim that transaction log shipping is the only real way to do backups. However, in my opinion, this is not necessarily true. Many people rely on pg_dump to simply extract a textual representation of the data. pg_dump is also the oldest method of creating a backup and has been around since the very early days of the project (transaction log shipping was added much later). Every PostgreSQL administrator will become familiar with pg_dump sooner or later, so it is important to know how it really works and what it does. Running pg_dump The first thing we want to do is to create a simple textual dump: [hs@linuxpc ~]$ pg_dump test > /tmp/dump.sql This is the most simplistic backup you can imagine. pg_dump logs into the local database instance connects to a database test and starts to extract all the data, which will be sent to stdout and redirected to the file. The beauty is that standard output gives you all the flexibility of a Unix system. You can easily compress the data using a pipe or do whatever you want. In some cases, you might want to run pg_dump as a different user. All PostgreSQL client programs support a consistent set of command-line parameters to pass user information. If you just want to set the user, use the -U flag: [hs@linuxpc ~]$ pg_dump -U whatever_powerful_user test > /tmp/dump.sql The following set of parameters can be found in all PostgreSQL client programs: ... Connection options: -d, --dbname=DBNAME database to dump -h, --host=HOSTNAME database server host or socket directory -p, --port=PORT database server port number -U, --username=NAME connect as specified database user -w, --no-password never prompt for password -W, --password force password prompt (should happen automatically) --role=ROLENAME do SET ROLE before dump ... Just pass the information you want to pg_dump, and if you have enough permissions, PostgreSQL will fetch the data. The important thing here is to see how the program really works. Basically, pg_dump connects to the database and opens a large repeatable read transaction that simply reads all the data. Remember, repeatable read ensures that PostgreSQL creates a consistent snapshot of the data, which does not change throughout the transactions. In other words, a dump is always consistent—no foreign keys will be violated. The output is a snapshot of data as it was when the dump started. Consistency is a key factor here. It also implies that changes made to the data while the dump is running won't make it to the backup anymore. A dump simply reads everything—therefore, there are no separate permissions to be able to dump something. As long as you can read it, you can back it up. Also, note that the backup is by default in a textual format. This means that you can safely extract data from say, Solaris, and move it to some other CPU architecture. In the case of binary copies, that is clearly not possible as the on-disk format depends on your CPU architecture. Passing passwords and connection information If you take a close look at the connection parameters shown in the previous section, you will notice that there is no way to pass a password to pg_dump. You can enforce a password prompt, but you cannot pass the parameter to pg_dump using a command-line option. The reason for that is simple: the password might show up in the process table and be visible to other people. Therefore, this is not supported. The question now is: if pg_hba.conf on the server enforces a password, how can the client program provide it? There are various means of doing that: Making use of environment variables Making use of .pgpass Using service files In this section, you will learn about all three methods. Using environment variables One way to pass all kinds of parameters is to use environment variables. If the information is not explicitly passed to pg_dump, it will look for the missing information in predefined environment variables. A list of all potential settings can be found here: https://www.postgresql.org/docs/10/static/libpq-envars.html. The following overview shows some environment variables commonly needed for backups: PGHOST: It tells the system which host to connect to PGPORT: It defines the TCP port to be used PGUSER: It tells a client program about the desired user PGPASSWORD: It contains the password to be used PGDATABASE: It is the name of the database to connect to The advantage of these environments is that the password won't show up in the process table. However, there is more. Consider the following example: psql -U ... -h ... -p ... -d ... Suppose you are a system administrator: do you really want to type a long line like that a couple of times every day? If you are working with the very same host again and again, just set those environment variables and connect with plain SQL: [hs@linuxpc ~]$ export PGHOST=localhost [hs@linuxpc ~]$ export PGUSER=hs [hs@linuxpc ~]$ export PGPASSWORD=abc [hs@linuxpc ~]$ export PGPORT=5432 [hs@linuxpc ~]$ export PGDATABASE=test [hs@linuxpc ~]$ psql psql (10.1) Type "help" for help. As you can see, there are no command-line parameters anymore. Just type psql and you are in. All applications based on the standard PostgreSQLC-language client library (libpq) will understand these environment variables, so you cannot only use them for psql and pg_dump, but for many other applications. Making use of .pgpass A very common way to store login information is via the use of .pgpass files. The idea is simple: put a file called .pgpass into your home directory and put your login information there. The format is simple: hostname:port:database:username:password An example would be: 192.168.0.45:5432:mydb:xy:abc PostgreSQL offers some nice additional functionality: most fields can contain *. Here is an example: *:*:*:xy:abc This means that on every host, on every port, for every database, the user called xy will use abc as the password. To make PostgreSQL use the .pgpass file, make sure that the right file permissions are in place: chmod 0600 ~/.pgpass .pgpass can also be used on a Windows system. In this case, the file can be found in the %APPDATA%postgresqlpgpass.conf path. Using service files However, there is not just the .pgpass file. You can also make use of service files. Here is how it works. If you want to connect to the very same servers over and over again, you can create a .pg_service.conf file. It will hold all the connection information you need. Here is an example of a .pg_service.conf file: Mac:~ hs$ cat .pg_service.conf # a sample service [hansservice] host=localhost port=5432 dbname=test user=hs password=abc [paulservice] host=192.168.0.45 port=5432 dbname=xyz user=paul password=cde To connect to one of the services, just set the environment and connect: iMac:~ hs$ export PGSERVICE=hansservice A connection can now be established without passing parameters to psql: iMac:~ hs$ psql psql (10.1) Type "help" for help. test=# Alternatively, you can use: psql service=hansservice Extracting subsets of data Up to now, you have seen how to dump an entire database. However, this is not what you might wish for. In many cases, you might just want to extract a subset of tables or schemas. pg_dump can do that and provides a number of switches: -a: It dumps only the data and does not dump the data structure -s: It dumps only the data structure but skips the data -n: It dumps only a certain schema -N: It dumps everything but excludes certain schemas -t: It dumps only certain tables -T: It dumps everything but certain tables (this can make sense if you want to exclude logging tables and so on) Partial dumps can be very useful to speed things up considerably. Handling various formats So far, you have seen that pg_dump can be used to create text files. The problem is that a text file can only be replayed completely. If you have saved an entire database, you can only replay the entire thing. In many cases, this is not what you want. Therefore, PostgreSQL has additional formats that also offer more functionality. At this point, four formats are supported: -F, --format=c|d|t|p output file format (custom, directory, tar, plain text (default)) You have already seen plain, which is just normal text. On top of that, you can use a custom format. The idea behind a custom format is to have a compressed dump, including a table of contents. Here are two ways to create a custom format dump: [hs@linuxpc ~]$ pg_dump -Fc test > /tmp/dump.fc [hs@linuxpc ~]$ pg_dump -Fc test -f /tmp/dump.fc In addition to the table of contents, the compressed dump has one more advantage: it is a lot smaller. The rule of thumb is that a custom format dump is around 90% smaller than the database instance you are about to back up. Of course, this highly depends on the number of indexes and all that, but for many database applications, this rough estimation will hold true. Once you have created the backup, you can inspect the backup file: [hs@linuxpc ~]$ pg_restore --list /tmp/dump.fc ; ; Archive created at 2017-11-04 15:44:56 CET ; dbname: test ; TOC Entries: 18 ; Compression: -1 ; Dump Version: 1.12-0 ; Format: CUSTOM ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 10.1 ; Dumped by pg_dump version: 10.1 ; ; Selected TOC Entries: ; 3103; 1262 16384 DATABASE - test hs 3; 2615 2200 SCHEMA - public hs 3104; 0 0 COMMENT - SCHEMA public hs 1; 3079 13350 EXTENSION - plpgsql 3105; 0 0 COMMENT - EXTENSION plpgsql 187; 1259 16391 TABLE public t_test hs ... pg_restore --list will return the table of contents of the backup. Using a custom format is a good idea as the backup will shrink in size. However, there is more; the -Fd command will create a backup in a directory format. Instead of a single file, you will now get a directory containing a couple of files: [hs@linuxpc ~]$ mkdir /tmp/backup [hs@linuxpc ~]$ pg_dump -Fd test -f /tmp/backup/ [hs@linuxpc ~]$ cd /tmp/backup/ [hs@linuxpc backup]$ ls -lh total 86M -rw-rw-r--. 1 hs hs 85M Jan 4 15:54 3095.dat.gz -rw-rw-r--. 1 hs hs 107 Jan 4 15:54 3096.dat.gz -rw-rw-r--. 1 hs hs 740K Jan 4 15:54 3097.dat.gz -rw-rw-r--. 1 hs hs 39 Jan 4 15:54 3098.dat.gz -rw-rw-r--. 1 hs hs 4.3K Jan 4 15:54 toc.dat One advantage of the directory format is that you can use more than one core to perform the backup. In the case of a plain or custom format, only one process will be used by pg_dump. The directory format changes that rule. The following example shows how you can tell pg_dump to use four cores (jobs): [hs@linuxpc backup]$ rm -rf * [hs@linuxpc backup]$ pg_dump -Fd test -f /tmp/backup/ -j 4 Note that the more objects you have in your database, the more potential speedup there will be. To summarize, you learned about creating backups in general. If you've enjoyed reading this post, do check out 'Mastering PostgreSQL 10' to know how to replay backup and handle global data in PostgreSQL10.  You will learn how to use PostgreSQL onboard tools to replicate instances. PostgreSQL 11 Beta 1 is out! How to perform data partitioning in PostgreSQL 10 How to implement Dynamic SQL in PostgreSQL 10
Read more
  • 0
  • 0
  • 31136

article-image-learning-the-salesforce-analytics-query-language-saql
Amey Varangaonkar
09 Mar 2018
6 min read
Save for later

Learning the Salesforce Analytics Query Language (SAQL)

Amey Varangaonkar
09 Mar 2018
6 min read
Salesforce Einstein offers its own query language to retrieve your data from various sources, called Salesforce Analytics Query Language (SAQL). The lenses and dashboards in Einstein use SAQL behind the scenes to manipulate data for meaningful visualizations. In this article, we see how to use Salesforce Analytics Query Language effectively. Using SAQL There are the following three ways to use SAQL in Einstein Analytics: Creating steps/lenses: We can use SAQL while creating a lens or step. It is the easiest way of using SAQL. While creating a step, Einstein Analytics provides the flexibility of switching between modes such as Chart Mode, Table Mode, and SAQL Mode. In this chapter, we will use this method for SAQL. Analytics REST API: Using this API, the user can access the datasets, lenses, dashboards, and so on. This is a programmatic approach and you can send the queries to the Einstein Analytics platform. Einstein Analytics uses the OAuth 2.0 protocol to securely access the platform data. The OAuth protocol is a way of securely authenticating the user without asking them for credentials. The first step to using the Analytics REST API to access Analytics is to authenticate the user using OAuth 2.0. Using Dashboard JSON: We can use SAQL while editing the Dashboard JSON. We have already seen the Dashboard JSON in previous chapters. To access Dashboard JSON, you can open the dashboard in the edit mode and press Ctrl + E. The simplest way of using SAQL is while creating a step or lens. A user can switch between the modes here. To use SAQL for lens, perform the following steps: Navigate to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner, as shown in the following screenshot: In SAQL, the query is made up of multiple statements. In the first statement, the query loads the input data from the dataset, operates on it, and then finally gives the result. The user can use the Run Query button to see the results and errors after changing or adding statements. The user can see the errors at the bottom of the Query editor. SAQL is made up of statements that take the input dataset, and we build our logic on that. We can add filters, groups, orders, and so on, to this dataset to get the desired output. There are certain order rules that need to be followed while creating these statements and those rules are as follows: There can be only one offset in the foreach statement The limit statement must be after offset The offset statement must be after filter and order The order and filter statements can be swapped as there is no rule for them In SAQL, we can perform all the mathematical calculations and comparisons. SAQL also supports arithmetic operators, comparison operators, string operators, and logical operators. Using foreach in SAQL The foreach statement applies the set of expressions to every row, which is called projection. The foreach statement is mandatory to get the output of the query. The following is the syntax for the foreach statement: q = foreach q generate expression as 'expresion name'; Let's look at one example of using the foreach statement: Go to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner. In the Query editor you will see the following code: q = load "opportunity"; q = group q by all; q = foreach q generate count() as 'count'; q = limit q 2000; You can see the result of this query just below the Query editor: 4. Now replace the third statement with the following statement: q = foreach q generate sum('Amount') as 'Sum Amount'; 5. Click on the Run Query button and observe the result as shown in the following screenshot: Using grouping in SAQL The user can group records of the same value in one group by using the group statements. Use the following syntax: q = group rows by fieldName Let's see how to use grouping in SAQL by performing the following steps: Replace the second and third statement with the following statement: q = group q by 'StageName'; q = foreach q generate 'StageName' as 'StageName', sum('Amount') as 'Sum Amount'; 2. Click on the Run Query button and you should see the following result: Using filters in SAQL Filters in SAQL behave just like a where clause in SOQL and SQL, filtering the data as per the condition or clause. In Einstein Analytics, it selects the row from the dataset that satisfies the condition added. The syntax for the filter is as follows: q = filter q by fieldName 'Operator' value Click on Run Query and view the result as shown in the following screenshot: Using functions in SAQL The beauty of a function is in its reusability. Once the function is created it can be used multiple times. In SAQL, we can use different types of functions, such as string  functions, math functions, aggregate functions, windowing functions, and so on. These functions are predefined and saved quite a few times. Let's use a math function power. The syntax for the power is power(m, n). The function returns the value of m raised to the nth power. Replace the following statement with the fourth statement: q = foreach q generate 'StageName' as 'StageName', power(sum('Amount'), 1/2) as 'Amount Squareroot', sum('Amount') as 'Sum Amount'; Click on the Run Query button. We saw how to apply different kinds of case-specific functions in Salesforce Einstein to play with data in order to get the desired outcome. [box type="note" align="" class="" width=""]The above excerpt is taken from the book Learning Einstein Analytics, written by Santosh Chitalkar. It covers techniques to set-up and create apps, lenses, and dashboards using Salesforce Einstein Analytics for effective business insights. If you want to know more about these techniques, check out the book Learning Einstein Analytics.[/box]     
Read more
  • 0
  • 1
  • 30864

article-image-object-detection-using-image-features-javascript
Packt
05 Oct 2015
16 min read
Save for later

Object Detection Using Image Features in JavaScript

Packt
05 Oct 2015
16 min read
In this article by Foat Akhmadeev, author of the book Computer Vision for the Web, we will discuss how we can detect an object on an image using several JavaScript libraries. In particular, we will see techniques such as FAST features detection, and BRIEF and ORB descriptors matching. Eventually, the object detection example will be presented. There are many ways to detect an object on an image. Color object detection, which is the detection of changes in intensity of an image is just a simple computer vision methods. There are some sort of fundamental things which every computer vision enthusiast should know. The libraries we use here are: JSFeat (http://inspirit.github.io/jsfeat/) tracking.js (http://trackingjs.com) (For more resources related to this topic, see here.) Detecting key points What information do we get when we see an object on an image? An object usually consists of some regular parts or unique points, which represent this particular object. Of course, we can compare each pixel of an image, but it is not a good thing in terms of computational speed. Probably, we can take unique points randomly, thus reducing the computation cost significantly, but we will still not get much information from random points. Using the whole information, we can get too much noise and lose important parts of an object representation. Eventually, we need to consider that both ideas, getting all pixels and selecting random pixels, are really bad. So, what can we do in that case? We are working with a grayscale image and we need to get unique points of an image. Then, we need to focus on the intensity information. For example, getting object edges in the Canny edge detector or the Sobel filter. We are closer to the solution! But still not close enough. What if we have a long edge? Don't you think that it is a bit bad that we have too many unique pixels that lay on this edge? An edge of an object has end points or corners; if we reduce our edge to those corners, we will get enough unique pixels and remove unnecessary information. There are various methods of getting keypoints from an image, many of which extract corners as those keypoints. To get them, we will use the FAST (Features from Accelerated Segment Test) algorithm. It is really simple and you can easily implement it by yourself if you want. But you do not need to. The algorithm implementation is provided by both tracking.js and JSFeat libraries. The idea of the FAST algorithm can be captured from the following image: Suppose we want to check whether the pixel P is a corner. We will check 16 pixels around it. If at least 9 pixels in an arc around P are much darker or brighter than the value of P, then we say that P is a corner. How much darker or brighter should the P pixels be? The decision is made by applying a threshold for the difference between the value of P and the value of pixels around P. A practical example First, we will start with an example of FAST corner detection for the tracking.js library. Before we do something, we can set the detector threshold. Threshold defines the minimum difference between a tested corner and the points around it: tracking.Fast.THRESHOLD = 30; It is usually a good practice to apply a Gaussian blur on an image before we start the method. It significantly reduces the noise of an image: var imageData = context.getImageData(0, 0, cols, rows); var gray = tracking.Image.grayscale(imageData.data, cols, rows, true); var blurred4 = tracking.Image.blur(gray, cols, rows, 3); Remember that the blur function returns a 4 channel array—RGBA. In that case, we need to convert it to 1-channel. Since we can easily skip other channels, it should not be a problem: var blurred1 = new Array(blurred4.length / 4); for (var i = 0, j = 0; i < blurred4.length; i += 4, ++j) { blurred1[j] = blurred4[i]; } Next, we run a corner detection function on our image array: var corners = tracking.Fast.findCorners(blurred1, cols, rows); The result returns an array with its length twice the length of the corner's number. The array is returned in the format [x0,y0,x1,y1,...]. Where [xn, yn] are coordinates of a detected corner. To print the result on a canvas, we will use the fillRect function: for (i = 0; i < corners.length; i += 2) { context.fillStyle = '#0f0'; context.fillRect(corners[i], corners[i + 1], 3, 3); } Let's see an example with the JSFeat library,. for which the steps are very similar to that of tracking.js. First, we set the global threshold with a function: jsfeat.fast_corners.set_threshold(30); Then, we apply a Gaussian blur to an image matrix and run the corner detection: jsfeat.imgproc.gaussian_blur(matGray, matBlurred, 3); We need to preallocate keypoints for a corners result. The keypoint_t function is just a new type which is useful for keypoints of an image. The first two parameters represent coordinates of a point and the other parameters set: point score (which checks whether the point is good enough to be a key point), point level (which you can use it in an image pyramid, for example), and point angle (which is usually used for the gradient orientation): var corners = []; var i = cols * rows; while (--i >= 0) { corners[i] = new jsfeat.keypoint_t(0, 0, 0, 0, -1); } After all this, we execute the FAST corner detection method. As a last parameter of detection function, we define a border size. The border is used to constrain circles around each possible corner. For example, you cannot precisely say whether the point is a corner for the [0,0] pixel. There is no [0, -3] pixel in our matrix: var count = jsfeat.fast_corners.detect(matBlurred, corners, 3); Since we preallocated the corners, the function returns the number of calculated corners for us. The result returns an array of structures with the x and y fields, so we can print it using those fields: for (var i = 0; i < count; i++) { context.fillStyle = '#0f0'; context.fillRect(corners[i].x, corners[i].y, 3, 3); } The result is nearly the same for both algorithms. The difference is in some parts of realization. Let's look at the following example: From left to right: tracking.js without blur, JSFeat without blur, tracking.js and JSFeat with blur. If you look closely, you can see the difference between tracking.js and JSFeat results, but it is not easy to spot it. Look at how much noise was reduced by applying just a small 3 x 3 Gaussian filter! A lot of noisy points were removed from the background. And now the algorithm can focus on points that represent flowers and the pattern of the vase. We have extracted key points from our image, and we successfully reached the goal of reducing the number of keypoints and focusing on the unique points of an image. Now, we need to compare or match those points somehow. How we can do that? Descriptors and object matching Image features by themselves are a bit useless. Yes, we have found unique points on an image. But what did we get? Only values of pixels and that's it. If we try to compare these values, it will not give us much information. Moreover, if we change the overall image brightness, we will not find the same keypoints on the same image! Taking into account all of this, we need the information that surrounds our key points. Moreover, we need a method to efficiently compare this information. First, we need to describe the image features, which comes from image descriptors. In this part, we will see how these descriptors can be extracted and matched. The tracking.js and JSFeat libraries provide different methods for image descriptors. We will discuss both. BRIEF and ORB descriptors The descriptors theory is focused on changes in image pixels' intensities. The tracking.js library provides the BRIEF (Binary Robust Independent Elementary Features) descriptors and its JSFeat extension—ORB (Oriented FAST and Rotated BRIEF). As we can see from the ORB naming, it is rotation invariant. This means that even if you rotate an object, the algorithm can still detect it. Moreover, the authors of the JSFeat library provide an example using the image pyramid, which is scale invariant too. Let's start by explaining BRIEF, since it is the source for ORB descriptors. As a first step, the algorithm takes computed image features, and it takes the unique pairs of elements around each feature. Based on these pairs' intensities it forms a binary string. For example, if we have a pair of positions i and j, and if I(i) < I(j) (where I(pos) means image value at the position pos), then the result is 1, else 0. We add this result to the binary string. We do that for N pairs, where N is taken as a power of 2 (128, 256, 512). Since descriptors are just binary strings, we can compare them in an efficient manner. To match these strings, the Hamming distance is usually used. It shows the minimum number of substitutions required to change one string to another. For example, we have two binary strings: 10011 and 11001. The Hamming distance between them is 2, since we need to change 2 bits of information to change the first string to the second. The JSFeat library provides the functionality to apply ORB descriptors. The core idea is very similar to BRIEF. However, there are two major differences: The implementation is scale invariant, since the descriptors are computed for an image pyramid. The descriptors are rotation invariant; the direction is computed using intensity of the patch around a feature. Using this orientation, ORB manages to compute the BRIEF descriptor in a rotation-invariant manner. Implementation of descriptors implementation and their matching Our goal is to find an object from a template on a scene image. We can do that by finding features and descriptors on both images and matching descriptors from a template to an image. We start from the tracking.js library and BRIEF descriptors. The first thing that we can do is set the number of location pairs: tracking.Brief.N = 512 By default, it is 256, but you can choose a higher value. The larger the value, the more information you will get and the more the memory and computational cost it requires. Before starting the computation, do not forget to apply the Gaussian blur to reduce the image noise. Next, we find the FAST corners and compute descriptors on both images. Here and in the next example, we use the suffix Object for a template image and Scene for a scene image: var cornersObject = tracking.Fast.findCorners(grayObject, colsObject, rowsObject); var cornersScene = tracking.Fast.findCorners(grayScene, colsScene, rowsScene); var descriptorsObject = tracking.Brief.getDescriptors(grayObject, colsObject, cornersObject); var descriptorsScene = tracking.Brief.getDescriptors(grayScene, colsScene, cornersScene); Then we do the matching: var matches = tracking.Brief.reciprocalMatch(cornersObject, descriptorsObject, cornersScene, descriptorsScene); We need to pass information of both corners and descriptors to the function, since it returns coordinate information as a result. Next, we print both images on one canvas. To draw the matches using this trick, we need to shift our scene keypoints for the width of a template image as a keypoint1 matching returns a point on a template and keypoint2 returns a point on a scene image. The keypoint1 and keypoint2 are arrays with x and y coordinates at 0 and 1 indexes, respectively: for (var i = 0; i < matches.length; i++) { var color = '#' + Math.floor(Math.random() * 16777215).toString(16); context.fillStyle = color; context.strokeStyle = color; context.fillRect(matches[i].keypoint1[0], matches[i].keypoint1[1], 5, 5); context.fillRect(matches[i].keypoint2[0] + colsObject, matches[i].keypoint2[1], 5, 5); context.beginPath(); context.moveTo(matches[i].keypoint1[0], matches[i].keypoint1[1]); context.lineTo(matches[i].keypoint2[0] + colsObject, matches[i].keypoint2[1]); context.stroke(); } The JSFeat library provides most of the code for pyramids and scale invariant features not in the library, but in the examples, which are available on https://github.com/inspirit/jsfeat/blob/gh-pages/sample_orb.html. We will not provide the full code here, because it requires too much space. But do not worry, we will highlight main topics here. Let's start from functions that are included in the library. First, we need to preallocate the descriptors matrix, where 32 is the length of a descriptor and 500 is the maximum number of descriptors. Again, 32 is a power of two: var descriptors = new jsfeat.matrix_t(32, 500, jsfeat.U8C1_t); Then, we compute the ORB descriptors for each corner, we need to do that for both template and scene images: jsfeat.orb.describe(matBlurred, corners, num_corners, descriptors); The function uses global variables, which mainly define input descriptors and output matching: function match_pattern() The result match_t contains the following fields: screen_idx: This is the index of a scene descriptor pattern_lev: This is the index of a pyramid level pattern_idx: This is the index of a template descriptor Since ORB works with the image pyramid, it returns corners and matches for each level: var s_kp = screen_corners[m.screen_idx]; var p_kp = pattern_corners[m.pattern_lev][m.pattern_idx]; We can print each matching as shown here. Again, we use Shift, since we computed descriptors on separate images, but print the result on one canvas: context.fillRect(p_kp.x, p_kp.y, 4, 4); context.fillRect(s_kp.x + shift, s_kp.y, 4, 4); Working with a perspective Let's take a step away. Sometimes, an object you want to detect is affected by a perspective distortion. In that case, you may want to rectify an object plane. For example, a building wall: Looks good, doesn't it? How do we do that? Let's look at the code: var imgRectified = new jsfeat.matrix_t(mat.cols, mat.rows, jsfeat.U8_t | jsfeat.C1_t); var transform = new jsfeat.matrix_t(3, 3, jsfeat.F32_t | jsfeat.C1_t); jsfeat.math.perspective_4point_transform(transform, 0, 0, 0, 0, // first pair x1_src, y1_src, x1_dst, y1_dst 640, 0, 640, 0, // x2_src, y2_src, x2_dst, y2_dst and so on. 640, 480, 640, 480, 0, 480, 180, 480); jsfeat.matmath.invert_3x3(transform, transform); jsfeat.imgproc.warp_perspective(mat, imgRectified, transform, 255); Primarily, as we did earlier, we define a result matrix object. Next, we assign a matrix for image perspective transformation. We calculate it based on four pairs of corresponding points. For example, the last, that is, the fourth point of the original image, which is [0, 480], should be projected to the point [180, 480] on the rectified image. Here, the first coordinate refers to x and the second to y. Then, we invert the transform matrix to be able to apply it to the original image—the mat variable. We pick the background color as white (255 for an unsigned byte). As a result, we get a nice image without any perspective distortion. Finding an object location Returning to our primary goal, we found a match. That is great. But what we did not do is finding an object location. There is no function for that in the tracking.js library, but JSFeat provides such functionality in the examples section. First, we need to compute a perspective transform matrix. This is why we discussed the example of such transformation previously. We have points from two images but we do not have a transformation for the whole image. First, we define a transform matrix: var homo3x3 = new jsfeat.matrix_t(3, 3, jsfeat.F32C1_t); To compute the homography, we need only four points. But after the matching, we get too many. In addition, there are can be noisy points, which we will need to skip somehow. For that, we use a RANSAC (Random sample consensus) algorithm. It is an iterative method for estimating a mathematical model from a dataset that contains outliers (noise). It estimates outliers and generates a model that is computed without the noisy data. Before we start, we need to define the algorithm parameters. The first parameter is a match mask, where all mathces will be marked as good (1) or bad (0): var match_mask = new jsfeat.matrix_t(500, 1, jsfeat.U8C1_t); Our mathematical model to find: var mm_kernel = new jsfeat.motion_model.homography2d(); Minimum number of points to estimate a model (4 points to get a homography): var num_model_points = 4; Maximum threshold to classify a data point as an inlier or a good match: var reproj_threshold = 3; Finally, the variable that holds main parameters and the last two arguments define the maximum ratio of outliers and probability of success when the algorithm stops at the point where the number of inliers is 99 percent: var ransac_param = new jsfeat.ransac_params_t(num_model_points, reproj_threshold, 0.5, 0.99); Then, we run the RANSAC algorithm. The last parameter represents the number of maximum iterations for the algorithm: jsfeat.motion_estimator.ransac(ransac_param, mm_kernel, object_xy, screen_xy, count, homo3x3, match_mask, 1000); The shape finding can be applied for both tracking.js and JSFeat libraries, you just need to set matches as object_xy and screen_xy, where those arguments must hold an array of objects with the x and y fields. After we find the transformation matrix, we compute the projected shape of an object to a new image: var shape_pts = tCorners(homo3x3.data, colsObject, rowsObject); After the computation is done, we draw computed shapes on our images: As we see, our program successfully found an object in both cases. Actually, both methods can show different performance, it is mainly based on the thresholds you set. Summary Image features and descriptors matching are powerful tools for object detection. Both JSFeat and tracking.js libraries provide different functionalities to match objects using these features. In addition to this, the JSFeat project contains algorithms for object finding. These methods can be useful for tasks such as uniques object detection, face tracking, and creating a human interface by tracking various objects very efficiently. Resources for Article: Further resources on this subject: Welcome to JavaScript in the full stack[article] Introducing JAX-RS API[article] Using Google Maps APIs with Knockout.js [article]
Read more
  • 0
  • 1
  • 30631

article-image-5-reasons-government-should-regulate-technology
Richard Gall
17 Jul 2018
6 min read
Save for later

5 reasons government should regulate technology

Richard Gall
17 Jul 2018
6 min read
Microsoft's Brad Smith took the unprecedented move last week of calling for government to regulate facial recognition technology. In an industry that has resisted government intervention, it was a bold yet humble step. It was a way of saying "we can't deal with this on our own." There will certainly be people who disagree with Brad Smith. For some the entrepreneurial spirit that is central to tech and startup culture will only be stifled by regulation. But let's be realistic about where we are at the moment - the technology industry has never faced such a crisis of confidence and met with substantial public cynicism. Perhaps government regulation is precisely what we need to move forward. Here are 4 reasons why government should regulate technology.  Regulation can restore accountability and rebuild trust in tech We've said it a lot in 2018, but there really is a significant trust deficit in technology at the moment. From Cambridge Analytica scandal to AI bias, software has been making headlines in a way it never has before. This only cultivates a culture of cynicism across the public. And with talk of automation and job losses, it paints a dark picture of the future. It's no wonder that TV series like Black Mirror have such a hold over the public imagination. Of course, when used properly, technology should simply help solve problems - whether that's better consumer tech or improved diagnoses in healthcare. The problem arises when we find that there our problem-solving innovations have unintended consequences. By regulating, government can begin to think through some of these unintended consequences. But more importantly, trust can only be rebuilt once there is some degree of accountability within the industry. Think back to Zuckerberg's Congressional hearing earlier this year - while the Facebook chief may have been sweating, the real takeaway was that his power and influence was ultimately untouchable. Whatever mistakes he's made were just part and parcel of moving fast and breaking things. An apology and a humble shrug might normally pass, but with regulation, things begin to get serious. Misusing user data? We've got a law for that. Potentially earning money from people who want to undermine western democracy? We've got a law for that. Read next: Is Facebook planning to spy on you through your mobile’s microphones? Government regulation will make the conversation around the uses and abuses of technology more public Too much conversation about how and why we build technology is happening in the wrong places. Well, not the wrong places, just not enough places. The biggest decisions about technology are largely made by some of the biggest companies on the planet. All the dreams about a new democratized and open world are all but gone, as the innovations around which we build our lives come from a handful of organizations that have both financial and cultural clout. As Brad Smith argues, tech companies like Microsoft, Google, and Amazon are not the place to be having conversations about the ethical implications of certain technologies. He argues that while it's important for private companies to take more responsibility, it's an "inadequate substitute for decision making by the public and its representatives in a democratic republic." He notes that the commercial dynamics are always going to twist conversations. Companies, after all, are answerable to shareholders - only governments are accountable to the public. By regulating, the decisions we make (or don't make) about technology immediately enter into public discourse about the kind of societies we want to live in. Citizens can be better protected by tech regulation... At present, technology often advances in spite of, not because of, people. For all the talk of human-centered design, putting the customer first, every company that builds software is interested in one thing: making money. AI in particular can be dangerous for citizens For example, according to a ProPublica investigation, AI has been used to predict future crimes in the justice system. That's frightening in itself, of course, but it's particularly terrifying when you consider that criminality was falsely predicted at twice the times for black people as white people. Even in the context of social media filters, in which machine learning serves content based on a user's behavior and profile presents dangers to citizens. It gives rise to fake news and dubious political campaigning, making citizens more vulnerable to extreme - and false - ideas. By properly regulating this technology we should immediately have more transparency over how these systems work. This transparency would not only lead to more accountability in how they are built, it also ensures that changes can be made when necessary. Read next: A quick look at E.U.’s pending antitrust case against Google’s Android ...Software engineers need protection too One group haven't really been talked about when it comes to government regulation - the people actually building the software. This a big problem. If we're talking about the ethics of AI, software engineers building software are left in a vulnerable position. This is because the lines of accountability are blurred. Without a government framework that supports ethical software decision making, engineers are left in limbo. With more support for software engineers from government, they can be more confident in challenging decisions from their employers. We need to have a debate about who's responsible for the ethics of code that's written into applications today - is it the engineer? The product manager? Or the organization itself? That isn't going to be easy to answer, but some government regulation or guidance would be a good place to begin. Regulation can bridge the gap between entrepreneurs, engineers and lawmakers Times change. Years ago, technology was deployed by lawmakers as a means of control, production or exploration. That's why the military was involved with many of the innovations of the mid-twentieth century. Today, the gap couldn't be bigger. Lawmakers barely understand encryption, let alone how algorithms work. But there is also naivety in the business world too. With a little more political nous and even critical thinking, perhaps Mark Zuckerberg could have predicted the Cambridge Analytica scandal. Maybe Elon Musk would be a little more humble in the face of a coordinated rescue mission. There's clearly a problem - on the one hand, some people don't know what's already possible. For others, it's impossible to consider that something that is possible could have unintended consequences. By regulating technology, everyone will have to get to know one another. Government will need to delve deeper into the field, and entrepreneurs and engineers will need to learn more about how regulation may affect them. To some extent, this will have to be the first thing we do - develop a shared language. It might also be the hardest thing to do, too.
Read more
  • 0
  • 0
  • 30548

article-image-inside-googles-project-dragonfly-china-ambitions
Aarthi Kumaraswamy
16 Oct 2018
8 min read
Save for later

OK Google, why are you ok with mut(at)ing your ethos for Project DragonFly?

Aarthi Kumaraswamy
16 Oct 2018
8 min read
Wired has managed to do what Congress couldn’t - bring together tech industry leaders in the US and ask the pressing questions of our times, in a safe and welcoming space. Just for this, they deserve applause. Yesterday at Wired 25 summit, Sundar Pichai, Google’s CEO, among other things, opened up to Backchannel’s Editor in chief, Steven Levy, about Project Dragonfly for the first time in public. Project Dragonfly is the secretive search engine that Google is allegedly developing which will comply with the Chinese rules of censorship. The following is my analysis of why Google is deeply invested in project Dragonfly.  Google’s mission since its inception has been to organize the world’s information and to make it universally accessible, as Steven puts it. When asked if this has changed in 2018, Pichai responded saying Google’s mission remains the same, and so do its founding values. However what has changed is the scale at which their operation, their user base, and their product portfolio. In effect, this means the company now views everything it does from a wider lens instead of just thinking about its users. [embed]https://www.facebook.com/wired/videos/vb.19440638720/178516206400033/?type=2&theater[/embed] For Google, China is an untapped source of information “We are compelled by our mission [to] provide information to everyone, and [China is] 20 percent of the world's population”,  said Pichai. He believes China is a highly innovative and underserved market that is too big to be ignored. For this reason, according to Pichai at least, Google is obliged to take a long-term view on the subject. But there are a number of specific reasons that make China compelling to Google right now. China is a huge social experiment at scale, with wide-scale surveillance and monitoring - in other words, data. But with the Chinese government keen to tightly control information about the country and its citizens, its not necessarily well understood by businesses from outside the country. This means moving into China could be an opportunity for Google to gain a real competitive advantage in a number of different ways. Pichai confirmed that internal tests show that Google can serve well over 99 percent of search queries from users in China. This means they probably have a good working product prototype to launch soon, should a window of opportunity arise. These lessons can then directly inform Google’s decisions about what to do next in China. What can Google do with all that exclusive knowledge? Pichai wrote earlier last week to some Senate members who wanted answers on Project Dragonfly that Google could have “broad benefits inside and outside of China.” He did not go into detail, but these benefits are clear. Google would gain insight into a huge country that tightly controls information about itself and its citizens. Helping Google to expand into new markets By extension, this will then bring a number of huge commercial advantages when it comes to China. It would place Google in a fantastic position to make China another huge revenue stream. Secondly, the data harvested in the process could provide a massive and critical boost to Google’s AI research, products and tooling ecosystems that others like Facebook don’t have access to. The less obvious but possibly even bigger benefits for Google are the wider applications of its insights. These will be particularly useful as it seeks to make inroads into other rapidly expanding markets such as India, Brazil, and the African subcontinent. Helping Google to consolidate its strength in western nations As well as helping Google expand, it’s also worth noting that Google’s Chinese venture could support the company as it seeks to consolidate and reassert itself in the west. Here, markets are not growing quickly, but Google could do more to advance its position within these areas using what it learns from business and product innovations in China. The caveat: Moral ambivalence is a slippery slope Let’s not forget that the first step into moral ambiguity is always the hardest. Once Google enters China, the route into murky and morally ambiguous waters actually gets easier. Arguably, this move could change the shape of Google as we know it. While the company may not care if it makes a commercial impact, the wider implications of how tech companies operate across the planet could be huge. How is Google rationalizing the decision to re-enter China Letting a billion flowers bloom and wither to grow a global forest seems to be at the heart of Google’s decision to deliberately pursue China’s market. Following are some ways Google has been justifying its decision We never left China When asked about why Google has decided to go back to China after exiting the market in 2010, Pichai clarified that Google never left China. They only stopped providing search services there. Android, for example, has become one of the popular mobile OSes in China over the years. He might as well have said ‘I already have a leg in the quicksand, might as well dip the other one’. Instead of assessing the reasons to stay in China through the lens of their AI principles, Google is jumping into the state censorship agenda. Being legally right is morally right “Any time we are working in countries around the world, people don't understand fully, but you're always balancing a set of values... Those values include providing access to information, freedom of expression, and user privacy… But we also follow the rule of law in every country,” said Pichai in the Wired 25 interview. This seems to imply that Google sees legal compliance analogous ethical practices. While the AI principles at Google should have guided them regarding situations precisely like this one, it has reduced to an oversimplified ‘don’t create killer AI’ tenet.  Just this Tuesday, China passed a law that is explicit about how it intends to use technology to implement extreme measures to suppress free expression and violate human rights. Google is choosing to turn a blind eye to how its technology could be used to indirectly achieve such nefarious outcomes in an efficient manner. We aren’t the only ones doing business in China Another popular reasoning, though not mentioned by Google, is that it is unfair to single out Google and ask them to not do business in China when others like Apple have been benefiting from such a relationship over the years. Just because everyone is doing something, it does not make it intrinsically right. As a company known for challenging the status quo and for stand by its values, this marks the day when Google lost its credentials to talk about doing the right thing. Time and tech wait for none. If we don’t participate, we will be left behind Pichai said, “Technology ends up progressing whether we want it to or not. I feel on every important technology it is important that you work aggressively to make sure the outcome is good.” Now that is a typical engineering response to a socio-philosophical problem. It reeks of hubris that most tech executives in Silicon Valley wear as badges of honor. We’re making information universally accessible and thus enriching lives Pichai observed that in China there are many areas, such as cancer treatment options, where Google can provide better and more authentic information than what products and services available. I don’t know about you, but when an argument leans on cancer to win its case, I typically disregard it. All things considered, in the race for AI domination, China’s data is the holy grail. An invitation to watch and learn from close quarters is an offer too good to refuse, for even Google. Even as current and former employees, human rights advocacy organizations, and Senate members continue to voice their dissent strongly, Google is sending a clear message that it isn’t going to back down on Project Dragonfly. The only way to stop this downward moral spiral at this point appears to be us, the Google current users, as the last line of defense to protect human rights, freedom of speech and other democratic values. That gives me a sinking feeling as I type this post in Google docs, use Chrome and Google search to gather information just way I have been doing for years now. Are we doomed to a dystopian future, locked in by tech giants that put growth over stability, viral ads over community, censorship, and propaganda over truth and free speech? Welcome to 1984.
Read more
  • 0
  • 0
  • 30451
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-how-far-will-facebook-go-to-fix-what-it-broke-democracy-trust-reality
Aarthi Kumaraswamy
24 Sep 2018
19 min read
Save for later

How far will Facebook go to fix what it broke: Democracy, Trust, Reality

Aarthi Kumaraswamy
24 Sep 2018
19 min read
Facebook, along with other tech media giants, like Twitter and Google, broke the democratic process in 2016. Facebook also broke the trust of many of its users as scandal after scandal kept surfacing telling the same story in different ways - the story of user data and trust abused in exchange for growth and revenue. The week before last, Mark Zuckerberg posted a long explanation on Facebook titled ‘Preparing for Elections’. It is the first of a series of reflections by Zuckerberg that ‘address the most important issues facing Facebook’. That post explored what Facebook is doing to avoid ending up in a situation similar to the 2016 elections when the platform ‘inadvertently’ became a super-effective channel for election interference of various kinds. It follows just weeks after Facebook COO, Sheryl Sandberg appeared in front of a Senate Intelligence hearing alongside Twitter CEO, Jack Dorsey on the topic of social media’s role in election interference. Zuckerberg’s mobile-first rigor oversimplifies the issues Zuckerberg opened his post with a strong commitment to addressing the issues plaguing Facebook using the highest levels of rigor the company has known in its history. He wrote, “I am bringing the same focus and rigor to addressing these issues that I've brought to previous product challenges like shifting our services to mobile.”  To understand the weight of this statement we must go back to how Facebook became a mobile-first company that beat investor expectations wildly. Suffice to say it went through painful years of restructuring and reorientation in the process. Those unfamiliar with that phase of Facebook, please read the section ‘How far did Facebook go to become a mobile-first company?’ at the end of this post for more details. To be fair, Zuckerberg does acknowledge that pivoting to mobile was a lot easier than what it will take to tackle the current set of challenges. He writes, “These issues are even harder because people don't agree on what a good outcome looks like, or what tradeoffs are acceptable to make. When it comes to free expression, thoughtful people come to different conclusions about the right balances. When it comes to implementing a solution, certainly some investors disagree with my approach to invest so much on security. We have a lot of work ahead, but I am confident we will end this year with much more sophisticated approaches than we began, and that the focus and investments we've put in will be better for our community and the world over the long term.” However, what Zuckerberg does not acknowledge in the above statement is that the current set of issues is not merely a product challenge, but a business ethics and sustainability challenge. Unless ‘an honest look in the mirror’ kind of analysis is done on that side of Facebook, any level of product improvements will only result in cosmetic changes that will end in an ‘operation successful, patient dead’ scenario. In the coming sections, I attempt to dissect Zuckerberg’s post in the context of the above points by reading between the lines to see how serious the platform really is about changing its ways to ‘be better for our community and the world over the long term’. Why does Facebook’s commitment to change feel hollow? Let’s focus on election interference in this analysis as Zuckerberg limits his views to this topic in his post. Facebook has been at the center of this story on many levels. Here is some context on where Zuckerberg is coming from.   Facebook’s involvement in the 2016 election meddling Apart from the traditional cyber-attacks (which they had even back then managed to prevent successfully), there were Russia-backed coordinated misinformation campaigns found on the platform. Then there was also the misuse of its user data by data analytics firm, Cambridge Analytica, which consulted on election campaigning. They micro-profiled users based on their psychographics (the way they think and behave) to ensure more effective ad spending by political parties. There was also the issue of certain kinds of ads, subliminal messages and peer pressure sent out to specific Facebook users during elections to prompt them to vote for certain candidates while others did not receive similar messages. There were also alleged reports of a certain set of users having been sent ‘dark posts’ (posts that aren’t publicly visible to all, but visible only to those on the target list) to discourage them from voting altogether. It also appears that Facebook staff offered both the Clinton and the Trump campaigns to assist with Facebook advertising. The former declined the offer while the latter accepted. We don’t know which of the above and to what extent each of these decisions and actions impacted the outcome of the 2016 US presidential elections. But one thing is certain, collectively they did have a significant enough impact for Zuckerberg and team to acknowledge these are serious problems that they need to address, NOW! Deconstructing Zuckerberg’s ‘Protecting Elections’ Before diving into what is problematic about the measures that are taken (or not taken) by Facebook, I must commend them for taking ownership of their role in election interference in the past and for attempting to rectify the wrongs. I like that Zuckerberg has made himself vulnerable by sharing his corrective plans with the public while it is a work in progress and is engaging with the public at a personal level. Facebook’s openness to academic research using anonymized Facebook data and their willingness to permit publishing findings without Facebook’s approval is also noteworthy. Other initiatives such as the political ad transparency report, AI enabled fake account & fake news reduction strategy, doubling the content moderator base, improving their recommendation algorithms are all steps in the right direction. However, this is where my list of nice things to say ends. The overall tone of Zuckerberg’s post is that of bargaining rather than that of acceptance. Interestingly this was exactly the tone adopted by Sandberg as well in the Senate hearing earlier this month, down to some very similar phrases. This makes one question if everything isn’t just one well-orchestrated PR disaster management plan. Disappointingly, most of the actions stated in Zuckerberg's post feel like half-measures; I get the sense that they aren’t willing to go the full distance to achieve the objectives they set for themselves. I hope to be wrong. 1. Zuckerberg focuses too much on ‘what’ and ‘how’, is ignoring the ‘why’ Zuckerberg identifies three key issues he wants to address in 2018: preventing election interference, protecting the community from abuse, and providing users with better control over their information. This clarity is a good starting point. In this post, he only focuses on the first issue. So I will reserve sharing my detailed thoughts on the other two for now. What I would say for now is that the key to addressing all issues on Facebook is taking a hard look at Facebook policies, including privacy, from a mission statement perspective. In other words, be honest about ‘Why Facebook exists’. Users are annoyed, advertisers are not satisfied and neither are shareholders confident about Facebook’s future. Trying to be everyone’s friend is clearly not working for Facebook. As such, I expected this in the opening part of the series. ‘Be better for our community and the world over the long term’ is too vague of a mission statement to be of any practical use. 2. Political Ad transparency report is necessary, but not sufficient In May this year, Facebook released its first political ad transparency report as a gesture to show its commitment to minimizing political interference. The report allows one to see who sponsored which issue advertisement and for how much. This was a move unanimously welcomed by everyone and soon others like Twitter and Google followed suit. By doing this, Facebook hopes to allow its users to form more informed views about political causes and other issues.   Here is my problem with this feature. (Yes, I do view this report as a ‘feature’ of the new Facebook app which serves a very specific need: to satisfy regulators and media.) The average Facebook user is not the politically or technologically savvy consumer. They use Facebook to connect with friends and family and maybe play silly games now and then. The majority of these users aren’t going to proactively check out this ad transparency report or the political ad database to arrive at the right conclusions. The people who will find this report interesting are academic researchers, campaign managers, and analysts. It is one more rich data point to understand campaign strategy and thereby infer who the target audience is. This could most likely lead to a downward spiral of more and more polarizing ads from parties across the spectrum. 3. How election campaigning, hate speech, and real violence are linked but unacknowledged Another issue closely tied with political ads is hate speech and violence-inciting polarising content that aren’t necessarily paid ads. These are typical content in the form of posts, images or videos that are posted as a response to political ads or discourses. These act as carriers that amplify the political message, often in ways unintended by the campaigners themselves. The echo chambers still exist. And the more one's ecosystem or ‘look-alike audience’ responds to certain types of ads or posts, users are more likely to keep seeing them, thanks to Facebook's algorithms. Seeing something that is endorsed by one’s friends often primes one to trust what is said without verifying the facts for themselves thus enabling fake news to go viral. The algorithm does the rest to ensure everyone who will engage with the content sees it. Newsy political ads will thrive in such a setup while getting away with saying ‘we made full disclosure in our report’. All of this is great for Facebook’s platform as it not only gets great engagement from the content but also increased ad spendings from all political parties as they can’t afford to be missing from action on Facebook. A by-product of this ultra-polarised scenario though is more protectionism and less free, open and meaningful dialog and debate between candidates as well as supporters on the platform. That’s bad news for the democratic process. 4. Facebook’s election interference prevention model is not scalable Their single-minded focus on eliminating US election interference on Facebook’s platforms through a multipronged approach to content moderation is worth appreciating. This also makes one optimistic about Facebook’s role in consciously attempting to do the right thing when it comes to respecting election processes in other nations as well. But the current approach of creating an ‘election war room’ is neither scalable nor sustainable. What happens everytime a constituency in the US has some election or some part of the world does? What happens when multiple elections take place across the world simultaneously? Who does Facebook prioritize to provide election interference defense support and why? Also, I wouldn’t go too far to trust that they will uphold individual liberties in troubled nations with strong regimes or strong divisive political discourses. What happens when the ruling party is the one interfering with the elections? Who is Facebook answerable to? 5. Facebook’s headcount hasn’t kept up with its own growth ambitions  Zuckerberg proudly states in his post that they’ve deleted a billion fake accounts with machine learning and have double the number of people hired to work on safety and security. "With advances in machine learning, we have now built systems that block millions of fake accounts every day. In total, we removed more than one billion fake accounts -- the vast majority within minutes of being created and before they could do any harm -- in the six months between October and March. ....it is still very difficult to identify the most sophisticated actors who build their networks manually one fake account at a time. This is why we've also hired a lot more people to work on safety and security -- up from 10,000 last year to more than 20,000 people this year." ‘People working on safety and security’ could have a wide range of job responsibilities from network security engineers to security guards hired at Facebook offices. What is missing conspicuously in the above picture is a breakdown of the number of people hired specifically to fact check, moderate content and resolve policy related disputes and review flagged content. With billions of users posting on Facebook, the job of content moderators and policy enforcers, even when assisted by algorithms, is massive. It is important that they are rightly incentivized to do their job well and are set clear and measurable goals. The post neither talks of how Facebook plans to reward moderators and neither does it talk about what the yardsticks for performance in this area would be. Facebook fails to acknowledge that it is not fully prepared, partly because it is understaffed. 6. The new Product Policy Director, human rights role is a glorified Public Relations job The weekend following Zuckerberg’s post, a new job opening appeared on Facebook’s careers page for the position of ‘Product policy director, human rights’. Below snippet is taken from that job posting. Source: Facebook careers The above is typically what a Public relations head does as well. Not only are the responsibilities cited above heavily communication and public perception building based, there’s not much given in terms of authority to this role to influence how other teams achieve their goals. Simply put, this role ‘works with, coordinates or advises teams’, it does not ‘guide or direct teams’. Als,o another key point to observe is that this role aims to add another layer of distance to further minimize exposure for Zuckerberg, Sandberg and other top key executives in public forums such as congressional hearings or press meets. Any role/area that is important to a business typically finds a place at the C-suite table. Had this new role been one of the c-suite roles it would have been advertised so, and it may have had some teeth. Of the 24 key executives in Facebook, only one is concerned with privacy and policy, ‘Chief Privacy Officer & VP of U.S. Public Policy’. Even this role does not have a global directive or public welfare in mind. On the other hand, there are multiple product development, creative and business development roles on Facebook’s c-suite. There is even a separate watch product head, a messaging product head, and one just dedicated to China called ‘Head of Creative Shop - Greater China’. This is why Facebook’s plan to protect elections will fail I am afraid Facebook’s greatest strength is also it’s Achilles heel. The tech industry’s deified hacker culture is embodied perfectly by Facebook. Facebook’s ad revenue based flawed business model is the ingenious creation of that very hacker culture. Any attempts to correct everything else is futile without correcting the issues with the current model. The ad revenue based model is why the Facebook app is designed the way it is: with ‘relevant’ news feeds, filter bubbles and look-alike audience segmentation. It is the reason why viral content gets rewarded irrespective of its authenticity or the impact it has on society. It is also the reason why Facebook has a ‘move fast and break things’ internal culture where growth at all costs is favored and idolized. Facebook’s Q2 2018 Earnings summary highlights the above points succinctly. Source: Facebook's SEC Filing The above snapshot means that even if we assume all 30k odd employees do some form of content moderation (the probability of which is zero), every employee is responsible for 50k users’ content daily. Let’s say every user only posts 1 post a day. If we assume Facebook’s news feed algorithms are super efficient and only find 2% of the user content questionable/fake (as speculated by Sandberg in her Senate hearing this month), that would still mean nearly 1k posts per person to review every day!   What can Facebook do to turn over a new leaf? Unless Facebook attempts to sincerely address at least some of the below, I will continue to be skeptical of any number of beautifully written posts by Zuckerberg or patriotically orated speeches by Sandberg. A content moderation transparency report that shares not just the number of posts moderated, the number of people working to moderate content on Facebook but also the nature of content moderated, the moderators’ job satisfaction levels, their tenure, qualifications, career aspirations, their challenges, and how much Facebook is investing in people, processes and technology to make its platform safe and objective for everyone to engage with others. A general Ad transparency report that not only lists advertisers on Facebook but also their spendings and chosen ad filters for the public and academia to review or analyze any time. Taking responsibility for the real-world consequences of actions enabled by Facebook. Like the recent gender and age discrimination employment ads shown on Facebook. Really banning hate speech and fake viral content. Bring in a business/AI ethics head who is only next to Zuckerberg and equal to Sandberg’s COO role. Exploring and experimenting with other alternative revenue channels to tackle the current ad-driven business model problem. Resolving the UI problem so that users can gain back control over their data and make it easy to choose to not participate in Facebook’s data experiments. This would mean a potential loss in some ad revenue. The ‘grow hacker’ culture problem that is a byproduct of years of moving fast and breaking things. This would mean a significant change in behavior by everyone starting from the top and probably restructuring the way teams are organized and business is done. It would also mean a different definition and measurement of success which could lead to shareholder backlash. But Mark is uniquely placed to withstand these pressures given his clout over the board voting powers. Like Augustus Caesar his role model, Zuckerberg has a chance to make history. But he might have to put the company through hard and sacrificing times in exchange for the proverbial 200 years of world peace. He’s got the best minds and limitless resources at his disposal to right what he and his platform wronged. But he would have to make enemies with the hands that feed him. Would he rise to the challenge? Like Augustus who is rumored to have killed his grandson, will Zuckerberg ever be prepared to kill his ad revenue generating brainchild? In the meanwhile, we must not underestimate the power of good digital citizenry. We must continue to fight the good fight to move tech giants like Facebook in the right direction. Just as persistent trickling water droplets can erode mountains and create new pathways, so can our mindful actions as digital platform users prompt major tech reforms. It could be as bold as deleting one's Facebook account (I haven’t been on the platform for years now, and I don’t miss it at all). You could organize groups to create awareness on topics like digital privacy, fake news, filter bubbles, or deliberately choose to engage with those whose views differ from yours to understand their perspective on topics and thereby do your part in reversing algorithmically accentuated polarity. It could also be by selecting the right individuals to engage in informed dialog with tech conglomerates. Not every action needs to be hard though. It could be as simple as customizing your default privacy settings or choosing to only spend a select amount of time on such platforms, or deciding to verify the authenticity and assessing the toxicity of a post you wish to like, share or forward to your network. Addendum How far did Facebook go to become a mobile-first company? Following are some of the things Facebook did to become the largest mobile advertising platform in the world, surpassing Google by a huge margin. Clear purpose and reason for the change: “For one, there are more mobile users. Second, they’re spending more time on it... third, we can have better advertising on mobile, make more money,” said Zuckerberg at TechCrunch Disrupt back in 2012 on why they were becoming mobile first. In other words, there was a lot of growth and revenue potential in investing in this space. This was a simple and clear ‘what’s in it for me’ incentive for everyone working to make the transition as well for stockholders and advertisers to place their trust in Zuckerberg’s endeavors. Setting company-wide accountability: “We realigned the company around, so everybody was responsible for mobile.”, said the then President of Business and Marketing Partnerships David Fischer to Fortune in 2013. Willing to sacrifice desktop for mobile: Facebook decided to make a bold gamble to lose its desktop users to grow its unproven mobile platform. Essentially it was willing to bet its only cash cow for a dark horse that was dependent on so many other factors to go right. Strict consequences for non-compliance: Back in the days of transitioning to a mobile-first company Zuckerberg famously said to all his product teams that when they went in for reviews: “Come in with mobile. If you come in and try to show me a desktop product, I’m going to kick you out. You have to come in and show me a mobile product.” Expanding resources and investing in reskilling: They grew their team of 20 mobile engineers to literally all engineers at Facebook undergoing training courses on iOS and Android development. “we’ve completely changed the way we do product development. We’ve trained all our engineers to do mobile first.”, said Facebook’s VP of corporate development, Vaughan Smith to TechCrunch by the end of 2012. Realigning product design philosophy: Designed custom features for the mobile-first interface instead of trying to adapt the features for the web to mobile. In other words, they began with mobile as their default user interface. Local and global user behavior sensitization: Some of their engineering teams even did field visits to developing nations like the Philippines to see first hand how mobile apps are being used there. Environmental considerations in app design: Facebook even had the foresight to consider scenarios where mobile users may not have quality internet signals or poor quality mobile battery related issues. They designed their apps keeping these future needs in mind.
Read more
  • 0
  • 0
  • 30437

article-image-opencv-4-0-releases-with-experimental-vulcan-g-api-module-and-qr-code-detector-among-others
Natasha Mathur
21 Nov 2018
2 min read
Save for later

OpenCV 4.0 releases with experimental Vulcan, G-API module and QR-code detector among others

Natasha Mathur
21 Nov 2018
2 min read
Two months after the OpenCV team announced the alpha release of Open CV 4.0, the final version 4.0 of OpenCV is here. OpenCV 4.0 was announced last week and is now available as a c++11 library that requires a c++ 11- compliant compiler. This new release explores features such as a G-API module, QR code detector, performance improvements, and DNN improvements among others. OpenCV is an open source library of programming functions which is mainly aimed at real-time computer vision. OpenCV is cross-platform and free for use under the open-source BSD license. Let’s have a look at what’s new in OpenCV 4.0. New Features G-API: OpenCV 4.0 comes with a completely new module opencv_gapi. G-API is an engine responsible for very efficient image processing, based on the lazy evaluation and on-fly construction of the processing graph. QR code detector and decoder: OpenCV 4.0 comprises QR code detector and decoder that has been added to opencv/objdetect module along with a live sample. The decoder is currently built on top of QUirc library. Kinect Fusion algorithm: A popular Kinect Fusion algorithm has been implemented, optimized for CPU and GPU (OpenCL), and integrated into opencv_contrib/rgbd module.  Kinect 2 support has also been updated in opencv/videoio module to make the live samples work. DNN improvements Support has been added for Mask-RCNN model. A new Integrated ONNX parser has been added. Support added for popular classification networks such as the YOLO object detection network. There’s been an improvement in the performance of the DNN module in OpenCV 4.0 when built with Intel DLDT support by utilizing more layers from DLDT. OpenCV 4.0 comes with experimental Vulkan backend that has been added for the platforms where OpenCL is not available. Performance improvements In OpenCV 4.0, hundreds of basic kernels in OpenCV have been rewritten with the help of "wide universal intrinsics". Wide universal intrinsics map to SSE2, SSE4, AVX2, NEON or VSX intrinsics, depending on the target platform and the compile flags. This leads to better performance, even for the already optimized functions. Support has been added for IPP 2019 using the IPPICV component upgrade. For more information, check out the official release notes. Image filtering techniques in OpenCV 3 ways to deploy a QT and OpenCV application OpenCV and Android: Making Your Apps See
Read more
  • 0
  • 0
  • 30408

article-image-emmanuel-tsukerman-on-why-a-malware-solution-must-include-a-machine-learning-component
Savia Lobo
30 Dec 2019
11 min read
Save for later

Emmanuel Tsukerman on why a malware solution must include a machine learning component

Savia Lobo
30 Dec 2019
11 min read
Machine learning is indeed the tech of present times! Security, which is a growing concern for many organizations today and machine learning is one of the solutions to deal with it. ML can help cybersecurity systems analyze patterns and learn from them to help prevent similar attacks and respond to changing behavior. To know more about machine learning and its application in Cybersecurity, we had a chat with Emmanuel Tsukerman, a Cybersecurity Data Scientist and the author of Machine Learning for Cybersecurity Cookbook. The book also includes modern AI to create powerful cybersecurity solutions for malware, pentesting, social engineering, data privacy, and intrusion detection. In 2017, Tsukerman's anti-ransomware product was listed in the Top 10 ransomware products of 2018 by PC Magazine. In his interview, Emmanuel talked about how ML algorithms help in solving problems related to cybersecurity, and also gave a brief tour through a few chapters of his book. He also touched upon the rise of deepfakes and malware classifiers. On using machine learning for cybersecurity Using Machine learning in Cybersecurity scenarios will enable systems to identify different types of attacks across security layers and also help to take a correct POA. Can you share some examples of the successful use of ML for cybersecurity you have seen recently? A recent and interesting development in cybersecurity is that the bad guys have started to catch up with technology; in particular, they have started utilizing Deepfake tech to commit crime; for example,they have used AI to imitate the voice of a CEO in order to defraud a company of $243,000. On the other hand, the use of ML in malware classifiers is rapidly becoming an industry standard, due to the incredible number of never-before-seen samples (over 15,000,000) that are generated each year. On staying updated with developments in technology to defend against attacks Machine learning technology is not only used by ethical humans, but also by Cybercriminals who use ML for ML-based intrusions. How can organizations counter such scenarios and ensure the safety of confidential organizational/personal data? The main tools that organizations have at their disposal to defend against attacks are to stay current and to pentest. Staying current, of course, requires getting educated on the latest developments in technology and its applications. For example, it’s important to know that hackers can now use AI-based voice imitation to impersonate anyone they would like. This knowledge should be propagated in the organization so that individuals aren’t caught off-guard. The other way to improve one’s security is by performing regular pen tests using the latest attack methodology; be it by attempting to avoid the organization’s antivirus, sending phishing communications, or attempting to infiltrate the network. In all cases, it is important to utilize the most dangerous techniques, which are often ML-based On how ML algorithms and GANs help in solving cybersecurity problems In your book, you have mentioned various algorithms such as clustering, gradient boosting, random forests, and XGBoost. How do these algorithms help in solving problems related to cybersecurity? Unless a machine learning model is limited in some way (e.g., in computation, in time or in training data), there are 5 types of algorithms that have historically performed best: neural networks, tree-based methods, clustering, anomaly detection and reinforcement learning (RL). These are not necessarily disjoint, as one can, for example, perform anomaly detection via neural networks. Nonetheless, to keep it simple, let’s stick to these 5 classes. Neural networks shine with large amounts of data on visual, auditory or textual problems. For that reason, they are used in Deepfakes and their detection, lie detection and speech recognition. Many other applications exist as well. But one of the most interesting applications of neural networks (and deep learning) is in creating data via Generative adversarial networks (GANs). GANs can be used to generate password guesses and evasive malware. For more details, I’ll refer you to the Machine Learning for Cybersecurity Cookbook. The next class of models that perform well are tree-based. These include Random Forests and gradient boosting trees. These perform well on structured data with many features. For example, the PE header of PE files (including malware) can be featurized, yielding ~70 numerical features. It is convenient and effective to construct an XGBoost model (a gradient-boosting model) or a Random Forest model on this data, and the odds are good that performance will be unbeatable by other algorithms. Next there is clustering. Clustering shines when you would like to segment a population automatically. For example, you might have a large collection of malware samples, and you would like to classify them into families. Clustering is a natural choice for this problem. Anomaly detection lets you fight off unseen and unknown threats. For instance, when a hacker utilizes a new tactic to intrude on your network, an anomaly detection algorithm can protect you even if this new tactic has not been documented. Finally, RL algorithms perform well on dynamic problems. The situation can be, for example, a penetration test on a network. The DeepExploit framework, covered in the book, utilizes an RL agent on top of metasploit to learn from prior pen tests and becomes better and better at finding vulnerabilities. Generative Adversarial Networks (GANs) are a popular branch of ML used to train systems against counterfeit data. How can these help in malware detection and safeguarding systems to identify correct intrusion? A good way to think about GANs is as a pair of neural networks, pitted against each other. The loss of one is the objective of the other. As the two networks are trained, each becomes better and better at its job. We can then take whichever side of the “tug of war” battle, separate it from its rival, and use it. In other cases, we might choose to “freeze” one of the networks, meaning that we do not train it, but only use it for scoring. In the case of malware, the book covers how to use MalGAN, which is a GAN for malware evasion. One network, the detector, is frozen. In this case, it is an implementation of MalConv. The other network, the adversarial network, is being trained to modify malware until the detection score of MalConv drops to zero. As it trains, it becomes better and better at this. In a practical situation, we would want to unfreeze both networks. Then we can take the trained detector, and use it as part of our anti-malware solution. We would then be confident knowing that it is very good at detecting evasive malware. The same ideas can be applied in a range of cybersecurity contexts, such as intrusion and deepfakes. On how Machine Learning for Cybersecurity Cookbook can help with easy implementation of ML for Cybersecurity problems What are some of the tools/ recipes mentioned in your book that can help cybersecurity professionals to easily implement machine learning and make it a part of their day-to-day activities? The Machine Learning for Cybersecurity Cookbook offers an astounding 80+ recipes. Themost applicable recipes will vary between individual professionals, and even for each individual different recipes will be applicable at different times in their careers. For a cybersecurity professional beginning to work with malware, the fundamentals chapter, chapter 2:ML-based Malware Detection, provides a solid and excellent start to creating a malware classifier. For more advanced malware analysts, Chapter 3:Advanced Malware Detection will offer more sophisticated and specialized techniques, such as dealing with obfuscation and script malware. Every cybersecurity professional would benefit from getting a firm grasp of chapter 4, “ML for Social Engineering”. In fact, anyone at all should have an understanding of how ML can be used to trick unsuspecting users, as part of their cybersecurity education. This chapter really shows that you have to be cautious because machines are becoming better at imitating humans. On the other hand, ML also provides the tools to know when such an attack is being performed. Chapter 5, “Penetration Testing Using ML” is a technical chapter, and is most appropriate to cybersecurity professionals that are concerned with pen testing. It covers 10 ways in which pen testing can be improved by using ML, including neural network-assisted fuzzing and DeepExploit, a framework that utilizes a reinforcement learning (RL) agent on top of metasploit to perform automatic pen testing. Chapter 6, “Automatic Intrusion Detection” has a wider appeal, as a lot of cybersecurity professionals have to know how to defend a network from intruders. They would benefit from seeing how to leverage ML to stop zero-day attacks on their network. In addition, the chapter covers many other use cases, such as spam filtering, Botnet detection and Insider Threat detection, which are more useful to some than to others. Chapter 7, “Securing and Attacking Data with ML” provides great content to cybersecurity professionals interested in utilizing ML for improving their password security and other forms of data security. Chapter 8, “Secure and Private AI”, is invaluable to data scientists in the field of cybersecurity. Recipes in this chapter include Federated Learning and differential privacy (which allow to train an ML model on clients’ data without compromising their privacy) and testing adversarial robustness (which allows to improve the robustness of ML models to adversarial attacks). Your book talks about using machine learning to generate custom malware to pentest security. Can you elaborate on how this works and why this matters? As a general rule, you want to find out your vulnerabilities before someone else does (who might be up to no-good). For that reason, pen testing has always been an important step in providing security. To pen test your Antivirus well, it is important to use the latest techniques in malware evasion, as the bad guys will certainly try them, and these are deep learning-based techniques for modifying malware. On Emmanuel’s personal achievements in the Cybersecurity domain Dr. Tsukerman, in 2017, your anti-ransomware product was listed in the ‘Top 10 ransomware products of 2018’ by PC Magazine. In your experience, why are ransomware attacks on the rise and what makes an effective anti-ransomware product? Also, in 2018,  you designed an ML-based, instant-verdict malware detection system for Palo Alto Networks' WildFire service of over 30,000 customers. Can you tell us more about this project? If you monitor cybersecurity news, you would see that ransomware continues to be a huge threat. The reason is that ransomware offers cybercriminals an extremely attractive weapon. First, it is very difficult to trace the culprit from the malware or from the crypto wallet address. Second, the payoffs can be massive, be it from hitting the right target (e.g., a HIPAA compliant healthcare organization) or a large number of targets (e.g., all traffic to an e-commerce web page). Thirdly, ransomware is offered as a service, which effectively democratizes it! On the flip side, a lot of the risk of ransomware can be mitigated through common sense tactics. First, backing up one’s data. Second, having an anti-ransomware solution that provides guarantees. A generic antivirus can provide no guarantee - it either catches the ransomware or it doesn’t. If it doesn’t, your data is toast. However, certain anti-ransomware solutions, such as the one I have developed, do offer guarantees (e.g., no more than 0.1% of your files lost). Finally, since millions of new ransomware samples are developed each year, the malware solution must include a machine learning component, to catch the zero-day samples, which is another component of the anti-ransomware solution I developed. The project at Palo Alto Networks is a similar implementation of ML for malware detection. The one difference is that unlike the anti-ransomware service, which is an endpoint security tool, it offers protection services from the cloud. Since Palo Alto Networks is a firewall-service provider, that makes a lot of sense, since ideally, the malicious sample will be stopped at the firewall, and never even reach the endpoint. To learn how to implement the techniques discussed in this interview, grab your copy of the Machine Learning for Cybersecurity Cookbook Don’t wait - the bad guys aren’t waiting. Author Bio Emmanuel Tsukerman graduated from Stanford University and obtained his Ph.D. from UC Berkeley. In 2017, Dr. Tsukerman's anti-ransomware product was listed in the Top 10 ransomware products of 2018 by PC Magazine. In 2018, he designed an ML-based, instant-verdict malware detection system for Palo Alto Networks' WildFire service of over 30,000 customers. In 2019, Dr. Tsukerman launched the first cybersecurity data science course. About the book Machine Learning for Cybersecurity Cookbook will guide you through constructing classifiers and features for malware, which you'll train and test on real samples. You will also learn to build self-learning, reliant systems to handle cybersecurity tasks such as identifying malicious URLs, spam email detection, intrusion detection, network protection, and tracking user and process behavior, and much more! DevSecOps and the shift left in security: how Semmle is supporting software developers [Podcast] Elastic marks its entry in security analytics market with Elastic SIEM and Endgame acquisition Businesses are confident in their cybersecurity efforts, but weaknesses prevail
Read more
  • 0
  • 0
  • 30378

article-image-terrifyingly-realistic-deepfake-video-of-bill-hader-transforming-into-tom-cruise-is-going-viral-on-youtube
Sugandha Lahoti
14 Aug 2019
4 min read
Save for later

Terrifyingly realistic Deepfake video of Bill Hader transforming into Tom Cruise is going viral on YouTube

Sugandha Lahoti
14 Aug 2019
4 min read
Deepfakes are becoming scaringly and indistinguishably real. A YouTube clip of Bill Hader in conversation with David Letterman on his late-night show in 2008 is going viral where Hader’s face subtly shifts to Cruise’s as Hader does his impression. This viral Deepfake clip has been viewed over 3 million times and is uploaded by Ctrl Shift Face (a Slovakian citizen who goes by the name of Tom), who has created other entertaining videos using Deepfake technology. For the unaware, Deepfake uses Artificial intelligence and deep neural networks to alter audio or video to pass it off as true or original content. https://www.youtube.com/watch?v=VWrhRBb-1Ig Deepfakes are problematic as they make it hard to differentiate between fake and real videos or images. This gives people the liberty to use deepfakes for promoting harassment and illegal activities. The most common use of deepfakes is found in revenge porn, political abuse, and fake celebrities videos as this one. The top comments on the video clip express dangers of realistic AI manipulation. “The fade between faces is absolutely unnoticeable and it's flipping creepy. Nice job!” “I’m always amazed with new technology, but this is scary.” “Ok, so video evidence in a court of law just lost all credibility” https://twitter.com/TheMuleFactor/status/1160925752004624387 Deepfakes can also be used as a weapon of misinformation since they can be used to maliciously hoax governments, populations and cause internal conflict. Gavin Sheridan, CEO of Vizlegal also tweeted the clip, “Imagine when this is all properly weaponized on top of already fractured and extreme online ecosystems and people stop believing their eyes and ears.” He also talked about future impact. “True videos will be called fake videos, fake videos will be called true videos. People steered towards calling news outlets "fake", will stop believing their own eyes. People who want to believe their own version of reality will have all the videos they need to support it,” he tweeted. He also tweeted whether we would require A-list movie actors at all in the future, and could choose which actor will portray what role. His tweet reads, “Will we need A-list actors in the future when we could just superimpose their faces onto the faces of other actors? Would we know the difference?  And could we not choose at the start of a movie which actors we want to play which roles?” The past year has seen accelerated growth in the use of deepfakes. In June, a fake video of Mark Zuckerberg was posted on Instagram, under the username, bill_posters_uk. In the video, Zuckerberg appears to give a threatening speech about the power of Facebook. Facebook had received strong criticism for promoting fake videos on its platform when in May, the company had refused to remove a doctored video of senior politician Nancy Pelosi. Samsung researchers also released a deepfake that could animate faces with just your voice and a picture using temporal GANs. Post this, the House Intelligence Committee held a hearing to examine the public risks posed by “deepfake” videos. Tom, the creator of the viral video told The Guardian that he doesn't see deepfake videos as the end of the world and hopes his deepfakes will raise public awareness of the technology's potential for misuse. “It’s an arms race; someone is creating deepfakes, someone else is working on other technologies that can detect deepfakes. I don’t really see it as the end of the world like most people do. People need to learn to be more critical. The general public are aware that photos could be Photoshopped, but they have no idea that this could be done with video.” Ctrl Shift Face is also on Patreon offering access to bonus materials, behind the scenes footage, deleted scenes, early access to videos for those who provide him monetary support. Now there is a Deepfake that can animate your face with just your voice and a picture. Mark Zuckerberg just became the target of the world’s first high profile white hat deepfake op. Worried about Deepfakes? Check out the new algorithm that manipulate talking-head videos by altering the transcripts.
Read more
  • 0
  • 0
  • 30340
article-image-new-qgis-3d-capabilities-and-future-plans-presented-by-martin-dobias-a-core-qgis-developer
Bhagyashree R
13 Dec 2019
8 min read
Save for later

New QGIS 3D capabilities and future plans presented by Martin Dobias, a core QGIS developer

Bhagyashree R
13 Dec 2019
8 min read
In his talk titled QGIS 3D: current state and future at FOSS4G 2019, Martin Dobias, CTO of Lutra Consulting talked about the new features in QGIS 3D. He also shared a list of features that can be added to QGIS 3D to make 3D rendering in QGIS more powerful. Free and Open Source Software for Geospatial (FOSS4G) 2019 was a five-day event that happened from Aug 26-30 at Bucharest. FOSS4G is a conference where geospatial professionals, students, professors come together to discuss about free and open-source software for geospatial storage, processing, and visualization. [box type="shadow" align="" class="" width=""] Further Learning This article explores the new features in QGIS 3D native rendering support. If you are embarking on your QGIS journey, check out our book Learn QGIS - Fourth Edition by Andrew Cutts and Anita Graser. In this book, you will explore QGIS user interface, load your data, edit, and then create data. QGIS often surprises new users with its mapping capabilities; you will discover how easily you can style and create your first map. But that’s not all! In the final part of the book, you’ll learn about spatial analysis, powerful tools in QGIS, and conclude by looking at Python processing options. [/box] 3D visualization in QGIS QGIS 3D native rendering support was introduced in QGIS 3. Prior to that, developers had to rely on third-party tools like NVIZ from GRASS GIS, GVIZ, Globe plugin, Qgis2threejs plugin, and more. Though these worked, “the integration was never great with the rest of QGIS,” remarks Dobias. In 2017, the QGIS grand proposal was accepted to start the initial work on QGIS 3D. A year later, QGIS 3 was announced with an interactive, fully integrated interface for you to work in 3D. QGIS 3 has a separate interface dedicated to 3D data visualization called 3D map view, which you can access from the View context menu. After you select this option, a new window will open that you can dock to the main panel. In the new window you will see all the layers that are visible in the main map view and rendered digital elevation and vector data in 3D. With native QGIS 3D support you can render raster, vector, and mesh layers. It also provides various methods for visualizing and styling the 3D data depending on the data or geometry type. Here are some of the features that Dobias talked about: Point-based rendering Starting with QGIS 3, you have three ways to render points: Basic symbols: You can use symbols such as spheres, cylinders, boxes, or cubes, apply a color, and apply a few transformations. 3D models loaded from a file: You can use the Open Asset Import Library (Assimp) to load the 3D models. This library allows you to import and export a wide-range of 3D model file formats including Collada, Wavefront, and more. After loading the model you can do tweaks like changing the color. However, there are currently limitations like “you can only change the color of the whole model and not the individual components,” Dobias mentioned. Billboard rendering: This feature was contributed by Ismail Sunni as a part of the Google Summer of Code (GSoC) 2019 project, QGIS 3D Improvement. The billboard support, which was released in QGIS 3.10, will allow you to render points as a billboard in 3D map view. Line rendering For line rendering, you have two options: Simple lines: In this approach, you define the width of a line in pixels and it does not change when you zoom-in or zoom-out. This technique preserves Z coordinates. Buffered lines: In this approach, you define the line width in map units. So, as soon as you start zooming in the line will appear zoomed out. Buffered rendering ignores z-coordinates. Polygon rendering For polygon rendering, you have four different options: Planar 3D entity: QGIS 3 provides a method to draw polygon geometries as planar polygons. Extrusion: Extrusion is a way to create 3D symbology from 2D features by stretching it vertically. QGIS now supports extruding a planar polygon to make it look like a box. You can specify a constant height or you can write an expression that determines it. Polyhedral surfaces or PolygonZ: QGIS 3 has a provision for creating polyhedral surfaces. Polyhedron is simply a three-dimensional solid which consists of a collection of polygons, usually joined at their edges. Triangular mesh or MultiPatch: It is similar to polyhedral surfaces, the only difference is that it consists of individual triangles. 3D map tools Navigation: You can use mouse and keyboard to navigate the map. Now, with the latest QGIS release you can also perform navigation using on-screen controls. Dobias said, “This is good for beginners when they are not completely sure about other means of moving the map.” Identify tool: With this tool, you can interact with the map canvas and get information on features in a pop-up window. It works exactly like its 2D counterpart, the only difference being it will be on a 3D entity. Measurement tool: This tool was also built as part of the GSoC project. This will enable you to measure real distances between given points. Other 3D capabilities Print layout support QGIS already had support to save the 3D map view as an image file, but for print layouts you needed to perform multiple steps. You had to first save 3D scene images and then embed them within print layouts. Also, the resolution of the saved images was limited to the size of the 3D window. To simplify the use of 3D scenes for printing and allow high resolution scene exports, QGIS 3 supports a new type of layout item that is capable of high resolution exports of 3D map scenes. Camera animation support With the QGIS 3D support, now users can define keyframes on a timeline with camera positions and view directions for various points in time. The 3D engine will interpolate camera parameters between keyframes to create animations. These resulting animations can then be played within the 3D view or exported frame-by-frame to a series of images. Configuration of lights By default, the 3D view has a single white light placed above the centre of the 3D scene. Now, users can set up light source position, color, and intensity and even define multiple lights for some interesting effects. Rule-based 3D rendering Previously, it was only possible to define one 3D renderer per layer meaning all features appear the same. QGIS 3 features rule-based rendering for 3D to make it much easier to apply more complex styling in 3D without having to duplicate vector layers and apply filters. There are many other 3D capabilities that you can explore including terrain shading, better camera control, and more. Where you can find data for 3D maps Dobias shared a few great 3D city models that are free to use including CityGML and CityJSON. To easily load CityJSON datasets in QGIS you can use the CityJSON Loader plugin. OpenStreetMap (OSM) is another project that provides buildings data. You can also use the Google dataset search. Just type CityGML in a search box and find the data you need. QGIS 3D capabilities to expect in the future Dobias further talked about the future plans for QGIS 3D. Currently, the team is working on improving support for larger 3D scenes and also make them load faster. For the far future, Dobias shared a wishlist of features that can be implemented in QGIS to make its 3D support much more powerful: Enhancing the 3D rendering performance More rendering techniques like shadows, transparency New materials to show textured objects More styles for vector layers such as lines and 3D pipes More data types such as point cloud and 3D rasters Formats support like 3D tiles, Arc SceneLayer Animation of data in scenes Profile tool Blender export Rendering of point cloud You just read about some of the latest features in QGIS 3 for 3D rendering. If you are new to QGIS and want to grasp its fundamentals, check out our book Learn QGIS - Fourth Edition by Anita Graser and Andrew Cutts. In this book, you will explore various ways to load data into QGIS, understand how to style data and present it in a map, and create maps and explore ways to expand them. You will get acquainted with the new processing toolbox in QGIS 3.4, manipulate your geospatial data and gain quality insights, and work with QGIS 3.4 in 3D. Why geospatial analysis and GIS matters more than ever today Top 7 libraries for geospatial analysis Uber’s kepler.gl, an open source toolbox for GeoSpatial Analysis
Read more
  • 0
  • 0
  • 30225

Packt
07 Feb 2017
32 min read
Save for later

Context – Understanding your Data using R

Packt
07 Feb 2017
32 min read
In this article by James D Miller, the author of the book Big Data Visualization we will explore the idea of adding context to the data you are working with. Specifically, we’ll discuss the importance of establishing data context, as well as the practice of profiling your data for context discovery as well how big data effects this effort. The article is organized into the following main sections: Adding Context About R R and Big Data R Example 1 -- healthcare data R Example 2 -- healthcare data (For more resources related to this topic, see here.) When writing a book, authors leave context clues for their readers. A context clue is a “source of information” about written content that may be difficult or unique that helps readers understand. This information offers insight into the content being read or consumed (an example might be: “It was an idyllic day; sunny, warm and perfect…”). With data, context clues should be developed, through a process referred to as profiling (we’ll discuss profiling in more detail later in this article), so that the data consumer can better understand (the data) when visualized. (Additionally, having context and perspective on the data you are working with is a vital step in determining what kind of data visualization should be created). Context or profiling examples might be calculating the average age of “patients” or subjects within the data or “segmenting the data into time periods” (years or months, usually). Another motive for adding context to data might be to gain a new perspective on the data. An example of this might be recognizing and examining a comparison present in the data. For example, body fat percentages of urban high school seniors could be compared to those of rural high school seniors. Adding context to your data before creating visualizations can certainly make it (the data visualization) more relevant, but context still can’t serve as a substitute for value. Before you consider any factors such as time of day or geographic location, or average age, first and foremost, your data visualization needs to benefit those who are going to consume it so establishing appropriate context requirements will be critical. For data profiling (adding context), the rule is: Before Context, Think →Value Generally speaking, there are a several visualization contextual categories, which can be used to argument or increase the value and understanding of data for visualization. These include: Definitions and explanations, Comparisons, Contrasts Tendencies Dispersion Definitions andexplanations This is providing additional information or “attributes” about a data point. For example, if the data contains a field named “patient ID” and we come to know that records describe individual patients, we may choose to calculate and add each individual patients BMI or body mass index: Comparisons This is adding a comparable value to a particular data point. For example, you might compute and add a national ranking to each “total by state”: Contrasts This is almost like adding an “opposite” to a data point to see if it perhaps determines a different perspective. An example might be reviewing average body weights for patients who consume alcoholic beverages verses those who do not consume alcoholic beverages: Tendencies These are the “typical” mathematical calculations (or summaries) on the data as a whole or by other category within the data, such as Mean, Median, and Mode. For example, you might add a median heart rate for the age group each patient in the data is a member of: Dispersion Again, these are mathematical calculations (or summaries), such as Range, Variance, and Standard Deviation, but they describe the "average" of a data set (or group within the data). For example, you may want to add the “range” for a selected value, such as the minimum and maximum number of hospital stays found in the data for each patient age group: The “art” of profiling data to add context and identify new and interesting perspectives for visualization is still and ever evolving; no doubt there are additional contextual categories existing today that can be investigated as you continue your work with big data visualization projects. Adding Context So, how do we add context to data? …is it merely select Insert, then Data Context? No, it’s not that easy (but it’s not impossible either). Once you have identified (or “pulled together”) your big data source (or at least a significant amount of data), how do you go from mountains of raw big data to summarizations that can be used as input to create valuable data visualizations, helping you to further analyze that data and support your conclusions? The answer is through data profiling. Data profiling involves logically “getting to know” the data you think you may want to visualize – through query, experimentation & review. Following the profiling process, you can then use the information you have collected to add context (and/or apply new “perspectives”) to the data. Adding context to data requires the manipulation of that data to perhaps reformat, adding calculations, aggregations or additional columns or re-ordering and so on. Finally, you will be ready to visualize (or “picture”) your data. The complete profiling process is shown below; as in: Pull together (the data or enough of the data), Profile (the data through query, experimentation and review), add Perspective(s) (or context) and finally… Picture (visualize) the data About R R is a language and environment easy to learn, very flexible in nature and also very focused on statistical computing- making it great for manipulating, cleaning, summarizing, producing probability statistics, etc. (as well as actually creating visualizations with your data), so it’s a great choice for the exercises required for profiling, establishing context and identifying additional perspectives. In addition, here are a few more reasons to use R when profiling your big data: R is used by a large number of academic statisticians – so it’s a tool that is not “going away” R is pretty much platform independent – what you develop will run almost any where R has awesome help resources – just Goggle it; you’ll see! R and Big Data Although R is free (open sourced), super flexible, and feature rich, you must keep in mind that R preserves everything in your machine’s memory and this can become problematic when you are working with big data (even with the introduction of the low resource costs of today). Thankfully, though there are various options and strategies to “work with” this limitation, such as imploring a sort of “pseudo-sampling” technique, which we will expound on later in this article (as part of some of the examples provided). Additionally, R libraries have been developed and introduced that can leverage hard drive space (as sort of a virtual extension to your machines memory), again exposed in this article’s examples. Example 1 In this article’s first example we’ll use data collected from a theoretical hospital where upon admission, patient medical history information is collected though an online survey. Information is also added to a “patients file” as treatment is provided. The file includes many fields including basic descriptive data for the patient such as: sex, date of birth, height, weight, blood type, etc. Vital statistics such as: blood pressure, heart rate, etc. Medical history such as: number of hospital visits, surgeries, major illnesses or conditions, currently under a doctor’s care, etc. Demographical statistics such as: occupation, home state, educational background, etc. Some additional information is also collected in the file in an attempt to develop patient characters and habits such as the number of times the patient included beef, pork and fowl in their weekly diet or if they typically use a butter replacement product, and so on. Periodically, the data is “dumped” to text files, are comma-delimited and contain the following fields (in this order): Patientid, recorddate_month, recorddate_day, recorddate_year, sex, age, weight, height, no_hospital_visits, heartrate, state, relationship, Insured, Bloodtype, blood_pressure, Education, DOBMonth, DOBDay, DOBYear, current_smoker, current_drinker, currently_on_medications, known_allergies, currently_under_doctors_care, ever_operated_on, occupation, Heart_attack, Rheumatic_Fever Heart_murmur, Diseases_of_the_arteries, Varicose_veins, Arthritis, abnormal_bloodsugar, Phlebitis, Dizziness_fainting, Epilepsy_seizures, Stroke, Diphtheria, Scarlet_Fever, Infectious_mononucleosis, Nervous_emotional_problems, Anemia, hyroid_problems, Pneumonia, Bronchitis, Asthma, Abnormal_chest_Xray, lung_disease, Injuries_back_arms_legs_joints_Broken_bones, Jaundice_gallbladder_problems, Father_alive, Father_current_age, Fathers_general_health, Fathers_reason_poor_health, Fathersdeceased_age_death, mother_alive, Mother_current_age, Mother_general_health, Mothers_reason_poor_health, Mothers_deceased_age_death, No_of_brothers, No_of_sisters, age_range, siblings_health_problems, Heart_attacks_under_50, Strokes_under_50, High_blood_pressure, Elevated_cholesterol, Diabetes, Asthma_hayfever, Congenital_heart_disease, Heart_operations, Glaucoma, ever_smoked_cigs, cigars_or_pipes, no_cigs_day, no_cigars_day, no_pipefuls_day, if_stopped_smoking_when_was_it, if_still_smoke_how_long_ago_start,target_weight, most_ever_weighed, 1_year_ago_weight, age_21_weight, No_of_meals_eatten_per_day, No_of_times_per_week_eat_beef, No_of_times_per_week_eat_pork, No_of_times_per_week_eat_fish, No_of_times_per_week_eat_fowl, No_of_times_per_week_eat_desserts, No_of_times_per_week_eat_fried_foods, No_servings_per_week_wholemilk, No_servings_per_week_2%_milk, No_servings_per_week_tea, No_servings_per_week_buttermilk, No_servings_per_week_1%_milk, No_servings_per_week_regular_or_diet_soda, No_servings_per_week_skim_milk, No_servings_per_week_coffee No_servings_per_week_water, beer_intake, wine_intake, liquor_intake, use_butter, use_extra_sugar, use_extra_salt, different_diet_weekends, activity_level, sexually_active, vision_problems, wear_glasses Following is the image showing a portion of the file (displayed in MS Windows notepad): Assuming we have been given no further information about the data, other than the provided field name list and the knowledge that the data is captured by hospital personnel upon patient admission, the next step would be to perform some sort of profiling of the data- investigating to start understanding the data and then to start adding context and perspectives (so ultimately we can create some visualizations). Initially, we start out by looking through the field or column names in our file and some ideas start to come to mind. For example: What is the data time-frame we are dealing with? Using the field record date, can we establish a period of time (or time frame) for the data? (In other words, over what period of time was this data captured). Can we start “grouping the data” using fields such as sex, age and state? Eventually, what we should be asking is, “what can we learn from visualizing the data?” Perhaps: What is the breakdown of those currently smoking by age group? What is the ratio of those currently smoking to the number of hospital visits? Do those patients currently under a doctor’s care, on average have better BMI ratios? And so on. Dig-in with R Using the power of R programming, we can run various queries on the data; noting that the results of those quires may spawn additional questions and queries and eventually, yield data ready for visualizing. Let’s start with a few simple profile queries. I always start my data profiling by “time boxing” the data. The following R scripts (although as mentioned earlier, there are many ways to accomplish the same objective) work well for this: # --- read our file into a temporary R table tmpRTable4TimeBox<-read.table(file="C:/Big Data Visualization/Chapter 3/sampleHCSurvey02.txt”, sep=",") # --- convert to an R data frame and filter it to just include # --- the 2nd column or field of data data.df <- data.frame(tmpRTable4TimeBox) data.df <- data.df[,2] # --- provides a sorted list of the years in the file YearsInData = substr(substr(data.df[],(regexpr('/',data.df[])+1),11),( regexpr('/',substr(data.df[],(regexpr('/',data.df[])+1),11))+1),11) # -- write a new file named ListofYears write.csv(sort(unique(YearsInData)),file="C:/Big Data Visualization /Chapter 3/ListofYears.txt",quote = FALSE, row.names = FALSE) The above simple R script provides a sorted list file (ListofYears.txt) (shown below) containing the years found in the data we are profiling: Now we can see that our patient survey data covers patient survey data collected during the years 1999 through 2016 and with this information we start to add context (or allow us to gain a perspective) on our data. We could further time-box the data by perhaps breaking the years into months (we will do this later on in this article) but let’s move on now to some basic “grouping profiling”. Assuming that each record in our data represents a unique hospital visit, how can we determine the number of hospital visits (the number of records) by sex, age and state? Here I will point out that it may be worthwhile establishing the size (number of rows or records (we already know the number of columns or fields) of the file you are working with. This is important since the size of the data file will dictate the programming or scripting approach you will need to use during your profiling. Simple R functions valuable to know are: nrow and head. These simple command can be used to count the total rows in a file: nrow:mydata Of to view the first n umber of rows of data: head(mydata, nrow=10) So, using R, one could write a script to load the data into a table, convert it to a data frame and then read through all the records in the file and “count up” or “tally” the number of hospital visits (the number of records) for males and females. Such logic is a snap to write: # --- assuming tmpRTable holds the data already datas.df<-data.frame(tmpRTable) # --- initialize 2 counter variables NumberMaleVisits <-0;NumberFemaleVisits <-0 # --- read through the data for(i in 1:nrow(datas.df)) { if (datas.df[i,3] == 'Male') {NumberMaleVisits <- NumberMaleVisits + 1} if (datas.df[i,3] == 'Female') {NumberFemaleVisits <- NumberFemaleVisits + 1} } # --- show me the totals NumberMaleVisits NumberFemaleVisits The previous script works, but in a big data scenario, there is a more efficient way, since reading or “looping through” and counting each record will take far too long. Thankfully, R provides the table function that can be used similar to the SQL “group by” command. The following script assumes that our data is already in an R data frame (named datas.df), so using the sequence number of the field in the file, if we want to see the number of hospital visits for Males and the number of hospital visits for Females we can write: # --- using R table function as "group by" field number # --- patient sex is the 3rd field in the file table(datas.df[,3]) Following is the output generated from running the above stated script. Notice that R shows “sex” with a count of 1 since the script included the files “header record” of the file as a unique value: We can also establish the number of hospital visits by state (state is the 9th field in the file): table(datas.df[,9]) Age (or the fourth field in the file) can also be studied using the R functions sort and table: Sort(table(datas.df[,4])) Note that since there are quite a few more values for age within the file, I’ve sorted the output using the R sort function. Moving on now, let’s see if there is a difference between the number of hospital visits for patients who are current smokers (field name current_smoker and is field number 16 in the file) and those indicating that they are non-current smokers. We can use the same R scripting logic: sort(table(datas.df[16])) Surprisingly (one might think) it appears from our profiling that those patients who currently do not smoke have had more hospital visits (113,681) than those who currently are smokers (12,561): Another interesting R script to continue profiling our data might be: table(datas.df[,3],datas.df[,16]) The above shown script again uses the R table function to group data, but shows how we can “group within a group”, in other words, using this script we can get totals for “current” and “non-current” smokers, grouped by sex. In the below image we see that the difference between female smokers and male smokers might be considered to be marginal: So we see that by using the above simple R script examples, we’ve been able to add some context to our healthcare survey data. By reviewing the list of fields provided in the file we can come up with the R profiling queries shown (and many others) without much effort. We will continue with some more complex profiling in the next section, but for now, let’s use R to create a few data visualizations - based upon what we’ve learned so far through our profiling. Going back to the number of hospital visits by sex, we can use the R function barplot to create a visualization of visits by sex. But first, a couple of “helpful hints” for creating the script. First, rather than using the table function, you can use the ftable function which creates a “flat” version of the original function’s output. This makes it easier to exclude the header record count of 1 that comes back from the table function. Next, we can leverage some additional arguments of the barplot function like col, border, names.arg and Title to make the visualization a little “nicer to look at”. Below is the script: # -- use ftable function to drop out the header record forChart<- ftable(datas.df[,3]) # --- create bar names barnames<-c("Female","Male") # -- use barplot to draw bar visual barplot(forChart[2:3], col = "brown1", border = TRUE, names.arg = barnames) # --- add a title title(main = list("Hospital Visits by Sex", font = 4)) The scripts output (our visualization) is shown below: We could follow the same logic for creating a similar visualization of hospital visits by state: st<-ftable(datas.df[,9]) barplot(st) title(main = list("Hospital Visits by State", font = 2)) But the visualization generated isn’t very clear: One can always experiment a bit more with this data to make the visualization a little more interesting. Using the R functions substr and regexpr, we can create an R data frame that contains a record for each hospital visit by state within each year in the file. Then we can use the function plot (rather than barplot) to generate the visualization. Below is the R script: # --- create a data frame from our original table file datas.df <- data.frame(tmpRTable) # --- create a filtered data frame of records from the file # --- using the record year and state fields from the file dats.df<-data.frame(substr(substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11),( regexpr('/',substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11))+1),11),datas.df[,9]) # --- plot to show a visualization plot(sort(table(dats.df[2]),decreasing = TRUE),type="o", col="blue") title(main = list("Hospital Visits by State (Highest to Lowest)", font = 2)) Here is the different (perhaps more interesting) version of the visualization generated by the previous script: Another earlier perspective on the data was concerning Age. We grouped the hospital visits by the age of the patients (using the R table function). Since there are many different patient ages, a common practice is to establish age ranges, such as the following: 21 and under 22 to 34 35 to 44 45 to 54 55 to 64 65 and over To implement the previous age ranges, we need to organize the data and could use the following R script: # --- initialize age range counters a1 <-0;a2 <-0;a3 <-0;a4 <-0;a5 <-0;a6 <-0 # --- read and count visits by age range for(i in 2:nrow(datas.df)) { if (as.numeric(datas.df[i,4]) < 22) {a1 <- a1 + 1} if (as.numeric(datas.df[i,4]) > 21 & as.numeric(datas.df[i,4]) < 35) {a2 <- a2 + 1} if (as.numeric(datas.df[i,4]) > 34 & as.numeric(datas.df[i,4]) < 45) {a3 <- a3 + 1} if (as.numeric(datas.df[i,4]) > 44 & as.numeric(datas.df[i,4]) < 55) {a4 <- a4 + 1} if (as.numeric(datas.df[i,4]) > 54 & as.numeric(datas.df[i,4]) < 65) {a5 <- a5 + 1} if (as.numeric(datas.df[i,4]) > 64) {a6 <- a6 + 1} } Big Data Note: Looping or reading through each of the records in our file isn’t very practical if there are a trillion records. Later in this article we’ll use a much better approach, but for now will assume a smaller file size for convenience. Once the above script is run, we can use the R pie function and the following code to create our pie chart visualization: # --- create Pie Chart slices <- c(a1, a2, a3, a4, a5, a6) lbls <- c("under 21", "22-34","35-44","45-54","55-64", "65 & over") pie(slices, labels = lbls, main="Hospital Visits by Age Range") Following is the generated visualization: Finally, earlier in this section we looked at the values in field 16 of our file - which indicates whether the survey patient was a current smoker. We could build a simple visual showing the totals, but (again) the visualization isn’t very interesting or all that informative. With some simple R scripts, we can proceed to create a visualization showing the number of hospital visits, year-over-year by those patients that are current smokers. First, we can “reformat” the data in our R data frame (named datas.df) to store only the year (of the record date) using the R function substr. This makes it a little easier to aggregate the data by year shown in the next steps. The R script using the substr function is shown below: # --- redefine the record date field to hold just the record # --- year value datas.df[,2]<-substr(substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11),( regexpr('/',substr(datas.df[,2],(regexpr('/',datas.df[,2])+1),11))+1),11) Next, we can create an R table named c to hold the record date year and totals (of non and current smokers) for each year. Following is the R script: used: # --- create a table holding record year and total count for # --- smokers and not smoking c<-table(datas.df[,2],datas.df[,16]) Finally, we can use the R barplot function to create our visualization. Again, there is more than likely a cleverer way to setup the objects bars and lbls, but for now, I simply hand-coded the year’s data I wanted to see in my visualization: # --- set up the values to chart and the labels for each bar # --- in the chart bars<-c(c[2,3], c[3,3], c[4,3],c[5,3],c[6,3],c[7,3],c[8,3],c[9,3],c[10,3],c[11,3],c[12,3],c[13,3]) lbls<-c("99","00","01","02","03","04","05","06","07","08","09","10") Now the R script to actually produce the bar chart visualization is shown below: # --- create the bar chart barplot(bars, names.arg=lbls, col="red") title(main = list("Smoking Patients Year to Year", font = 2)) Below is the generated visualization: Example 2 In the above examples, we’ve presented some pretty basic and straight forward data profiling exercises. Typically, once you’ve become somewhat familiar with your data – having added some context (though some basic profiling), one would extend the profiling process, trying to look at the data in additional ways using technics such as those mentioned in the beginning of this article: Defining new data points based upon the existing data, performing comparisons, looking at contrasts (between data points), identifying tendencies and using dispersions to establish the variability of the data. Let’s now review some of these options for extended profiling using simple examples as well as the same source data as was used in the previous section examples. Definitions & Explanations One method of extending your data profiling is to “add to” the existing data by creating additional definition or explanatory “attributes” (in other words add new fields to the file). This means that you use existing data points found in the data to create (hopefully new and interesting) perspectives on the data. In the data used in this article, a thought-provoking example might be to use the existing patient information (such as the patients weight and height) to calculate a new point of data: body mass index (BMI) information. A generally accepted formula for calculating a patient’s body mass index is: BMI = (Weight (lbs.) / (Height (in))2) x 703 For example: (165 lbs.) / (702) x 703 = 23.67 BMI. Using the above formula, we can use the following R script (assuming we’ve already loaded the R object named tmpRTable with our file data) to generate a new file of BMI percentages and state names: j=1 for(i in 2:nrow(tmpRTable)) { W<-as.numeric(as.character(tmpRTable[i,5])) H<-as.numeric(as.character(tmpRTable[i,6])) P<-(W/(H^2)*703) datas2.df[j,1]<-format(P,digits=3) datas2.df[j,2]<-tmpRTable[i,9] j=j+1 } write.csv(datas2.df[1:j-1,1:2],file="C:/Big Data Visualization/Chapter 3/BMI.txt", quote = FALSE, row.names = FALSE) Below is a portion of the generated file: Now we have a new file of BMI percentages by state (one BMI record for each hospital visit in each state). Earlier in this article we touched on the concept of looping or reading through all of the records in a file or data source and creating counts based on various field or column values. Such logic works fine for medium or smaller files but a much better approach (especially with big data files) would be to use the power of various R commands. No Looping Although the above described R script does work, it requires looping through each record in our file which is slow and inefficient to say the least. So, let’s consider a better approach. Again, assuming we’ve already loaded the R object named tmpRTable with our data, the below R script can accomplish the same results (create the same file) in just 2 lines: PDQ<-paste(format((as.numeric(as.character(tmpRTable[,5]))/(as.numeric(as.character(tmpRTable[,6]))^2)*703),digits=2),',',tmpRTable[,9],sep="") write.csv(PDQ,file="C:/Big Data Visualization/Chapter 3/BMI.txt", quote = FALSE,row.names = FALSE) We could now use this file (or one similar) as input to additional profiling exercise or to create a visualization, but let’s move on. Comparisons Performing comparisons during data profiling can also add new and different perspectives to the data. Beyond simple record counts (like total smoking patients visiting a hospital verses the total non-smoking patients visiting a hospital) one might ponder to compare the total number of hospital visits for each state to the average number of hospital visits for a state. This would require calculating the total number of hospital visits by state as well as the total number of hospital visits over all (then computing the average). The following 2 lines of code use the R functions table and write.csv to create a list (a file) of the total number of hospital visits found for each state: # --- calculates the number of hospital visits for each # --- state (state ID is in field 9 of the file StateVisitCount<-table(datas.df[9]) # --- write out a csv file of counts by state write.csv (StateVisitCount, file="C:/Big Data Visualization/Chapter 3/visitsByStateName.txt", quote = FALSE, row.names = FALSE) Below is a portion of the file that is generated: The following R command can be used to calculate the average number of hospitals by using the nrow function to obtain a count of records in the data source and then divide it by the number of states: # --- calculate the average averageVisits<-nrow(datas.df)/50 Going a bit further with this line of thinking, you might consider that the nine states the U.S. Census Bureau designates as the Northeast region are Connecticut, Maine, Massachusetts, New Hampshire, New York, New Jersey, Pennsylvania, Rhode Island and Vermont. What is the total number of hospital visits recorded in our file for the northeast region? R makes it simple with the subset function: # --- use subset function and the “OR” operator to only have # --- northeast region states in our list NERVisits<-subset(tmpRTable, as.character(V9)=="Connecticut" | as.character(V9)=="Maine" | as.character(V9)=="Massachusetts" | as.character(V9)=="New Hampshire" | as.character(V9)=="New York" | as.character(V9)=="New Jersey" | as.character(V9)=="Pennsylvania" | as.character(V9)=="Rhode Island" | as.character(V9)=="Vermont") Extending our scripting we can add some additional queries to calculate the average number of hospital visits for the northeast region and the total country: AvgNERVisits<-nrow(NERVisits)/9 averageVisits<-nrow(tmpRTable)/50 And let’s add a visualization: # -- the c objet is the the data for the barplot function to # --- graph c<-c(AvgNERVisits, averageVisits) # --- use R barplot barplot(c, ylim=c(0,3000), ylab="Average Visits", border="Black", names.arg = c("Northeast","all")) title("Northeast Region vs Country") The generated visualzation is shown below: Contrasts The examination of contrasting data is another form of extending data profiling. For example, using this article’s data, one could contrast the average body weight of patients that are under doctor’s care against the average body weight of patients that are not under a doctor’s care (after calculating average body weights for each group). To accomplish this, we can calculate the average weights for patients that fall into each category (those currently under a doctor’s care and those not currently under a doctor’s care) as well as for all patients, using the following R script: # --- read in our entire file tmpRTable<-read.table(file="C:/Big Data Visualization/Chapter 3/sampleHCSurvey02.txt",sep=",") # --- use the subset functionto create the 2 groups we are # --- interested in UCare.sub<-subset(tmpRTable, V20=="Yes") NUCare.sub<-subset(tmpRTable, V20=="No") # --- use the mean function to get the average body weight of all pateints in the file as well as for each of our separate groups average_undercare<-mean(as.numeric(as.character(UCare.sub[,5]))) average_notundercare<-mean(as.numeric(as.character(NUCare.sub[,5]))) averageoverall<-mean(as.numeric(as.character(tmpRTable[2:nrow(tmpRTable),5]))) average_undercare;average_notundercare;averageoverall In “short order”, we can use R’s ability to create subsets (using the subset function) of the data based upon values in a certain field (or column), then use the mean function to calculate the average patient weight for the group. The results from running the script (the calculated average weights) are shown below: And if we use the calculated results to create a simple visualization: # --- use R barplot to create the bar graph of # --- average patient weight barplot(c, ylim=c(0,200), ylab="Patient Weight", border="Black", names.arg = c("under care","not under care", "all"), legend.text= c(format(c[1],digits=5),format(c[2],digits=5),format(c[3],digits=5)))> title("Average Patient Weight") Tendencies Identifying tendencies present within your data is also an interesting way of extending data profiling. For example, using this article’s sample data, you might determine what the number of servings of water that was consumed per week by each patient age group. Earlier in this section we created a simple R script to count visits by age groups; it worked, but in a big data scenario, this may not work. A better approach would be to categorize the data into the age groups (age is the fourth field or column in the file) using the following script: # --- build subsets of each age group agegroup1<-subset(tmpRTable, as.numeric(V4)<22) agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35) agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45) agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55) agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66) agegroup6<-subset(tmpRTable, as.numeric(V4)>64) After we have our grouped data, we can calculate water consumption. For example, to count the total weekly servings of water (which is in field or column 96) for age group 1 we can use: # --- field 96 in the file is the number of servings of water # --- below line counts the total number of water servings for # --- age group 1 sum(as.numeric(agegroup1[,96])) Or the average number of servings of water for the same age group: mean(as.numeric(agegroup1[,96])) Take note that R requires the explicit conversion of the value of field 96 (even though it comes in the file as a number) to a number using the R function as.numeric. Now, let’s see create the visualization of this perspective of our data. Below is the R script used to generate the visualization: # --- group the data into age groups agegroup1<-subset(tmpRTable, as.numeric(V4)<22) agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35) agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45) agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55) agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66) agegroup6<-subset(tmpRTable, as.numeric(V4)>64) # --- calculate the averages by group g1<-mean(as.numeric(agegroup1[,96])) g2<-mean(as.numeric(agegroup2[,96])) g3<-mean(as.numeric(agegroup3[,96])) g4<-mean(as.numeric(agegroup4[,96])) g5<-mean(as.numeric(agegroup5[,96])) g6<-mean(as.numeric(agegroup6[,96])) # --- create the visualization barplot(c(g1,g2,g3,g4,g5,g6), + axisnames=TRUE, names.arg = c("<21", "22-34", "35-44", "45-54", "55-64", ">65")) > title("Glasses of Water by Age Group") The generated visualization is shown below: Dispersion Finally, dispersion is still another method of extended data profiling. Dispersion measures how various elements selected behave with regards to some sort of central tendency, usually the mean. For example, we might look at the total number of hospital visits for each age group, per calendar month in regards to the average number of hospital visits per month. For this example, we can use the R function subset in the R scripts (to define our age groups and then group the hospital records by those age groups) like we did in our last example. Below is the script, showing the calculation for each group: agegroup1<-subset(tmpRTable, as.numeric(V4) <22) agegroup2<-subset(tmpRTable, as.numeric(V4)>21 & as.numeric(V4)<35) agegroup3<-subset(tmpRTable, as.numeric(V4)>34 & as.numeric(V4)<45) agegroup4<-subset(tmpRTable, as.numeric(V4)>44 & as.numeric(V4)<55) agegroup5<-subset(tmpRTable, as.numeric(V4)>54 & as.numeric(V4)<66) agegroup6<-subset(tmpRTable, as.numeric(V4)>64) Remember, the previous scripts create subsets of the entire file (which we loaded into the object tmpRTable) and they contain all of the fields of the entire file. The agegroup1 group is partially displayed as follows: Once we have our data categorized by age group (agegroup1 through agegroup6), we can then go on and calculate a count of hospital stays by month for each group (shown in the following R commands). Note that the substr function is used to look at the month code (the first 3 characters of the record date) in the file since we (for now) don’t care about the year. The table function then can be used to create an array of counts by month. az1<-table(substr(agegroup1[,2],1,3)) az2<-table(substr(agegroup2[,2],1,3)) az3<-table(substr(agegroup3[,2],1,3)) az4<-table(substr(agegroup4[,2],1,3)) az5<-table(substr(agegroup5[,2],1,3)) az6<-table(substr(agegroup6[,2],1,3)) Using the above month totals, we can then calculate an average number of hospital visits for each month using the R function mean. This will be the mean function of the total for the month for ALL age groups: JanAvg<-mean(az1["Jan"], az2["Jan"], az3["Jan"], az4["Jan"], az5["Jan"], az6["Jan"]) Note that the above code example can be used to calculate an average for each month Next we can calculate the totals for each month, for each age group: Janag1<-az1["Jan"];Febag1<-az1["Feb"];Marag1<-az1["Mar"];Aprag1<-az1["Apr"];Mayag1<-az1["May"];Junag1<-az1["Jun"] Julag1<-az1["Jul"];Augag1<-az1["Aug"];Sepag1<-az1["Sep"];Octag1<-az1["Oct"];Novag1<-az1["Nov"];Decag1<-az1["Dec"] The following code “stacks” the totals so we can more easily visualize it later (we would have one line for each age group (that is, Group1Visits, Group2Visits and so on). Monthly_Visits<-c(JanAvg, FebAvg, MarAvg, AprAvg, MayAvg, JunAvg, JulAvg, AugAvg, SepAvg, OctAvg, NovAvg, DecAvg) Group1Visits<-c(Janag1,Febag1,Marag1,Aprag1,Mayag1,Junag1,Julag1,Augag1,Sepag1,Octag1,Novag1,Decag1) Group2Visits<-c(Janag2,Febag2,Marag2,Aprag2,Mayag2,Junag2,Julag2,Augag2,Sepag2,Octag2,Novag2,Decag2) Finally, we can now create the visualization: plot(Monthly_Visits, ylim=c(1000,4000)) lines(Group1Visits, type="b", col="red") lines(Group2Visits, type="b", col="purple") lines(Group3Visits, type="b", col="green") lines(Group4Visits, type="b", col="yellow") lines(Group5Visits, type="b", col="pink") lines(Group6Visits, type="b", col="blue") title("Hosptial Visits", sub = "Month to Month", cex.main = 2, font.main= 4, col.main= "blue", cex.sub = 0.75, font.sub = 3, col.sub = "red") and enjoy the generated output: Summary In this article we went over the idea and importance of establishing context and perhaps identifying perspectives to big data, using the data profiling with R. Additionally, we introduced and explored the R Programming language as an effective means to profile big data and used R in numerous illustrative examples. Once again, R is an extremely flexible and powerful tool that works well for data profiling and the reader would be well served researching and experimenting with the languages vast libraries available today as we have only scratched the surface of the features currently available. Resources for Article: Further resources on this subject: Introduction to R Programming Language and Statistical Environment [article] Fast Data Manipulation with R [article] DevOps Tools and Technologies [article]
Read more
  • 0
  • 2
  • 30180

article-image-all-of-my-engineering-teams-have-a-machine-learning-feature-on-their-roadmap-will-ballard-talks-artificial-intelligence-in-2019-interview
Packt Editorial Staff
02 Jan 2019
3 min read
Save for later

“All of my engineering teams have a machine learning feature on their roadmap” - Will Ballard talks artificial intelligence in 2019 [Interview]

Packt Editorial Staff
02 Jan 2019
3 min read
The huge advancements of deep learning and artificial intelligence were perhaps the biggest story in tech in 2018. But we wanted to know what the future might hold - luckily, we were able to speak to Packt author Will Ballard about what they see as in store for artificial in 2019 and beyond. Will Ballard is the chief technology officer at GLG, responsible for engineering and IT. He was also responsible for the design and operation of large data centers that helped run site services for customers including Gannett, Hearst Magazines, NFL, NPR, The Washington Post, and Whole Foods. He has held leadership roles in software development at NetSolve (now Cisco), NetSpend, and Works (now Bank of America). Explore Will Ballard's Packt titles here. Packt: What do you think the biggest development in deep learning / AI was in 2018? Will Ballard: I think attention models beginning to take the place of recurrent networks is a pretty impressive breakout on the algorithm side. In Packt’s 2018 Skill Up survey, developers across disciplines and job roles identified machine learning as the thing they were most likely to be learning in the coming year. What do you think of that result? Do you think machine learning is becoming a mandatory multidiscipline skill, and why? Almost all of my engineering teams have an active, or a planned machine learning feature on their roadmap. We’ve been able to get all kinds of engineers with different backgrounds to use machine learning -- it really is just another way to make functions -- probabilistic functions -- but functions. What do you think the most important new deep learning/AI technique to learn in 2019 will be, and why? In 2019 -- I think it is going to be all about PyTorch and TensorFlow 2.0, and learning how to host these on cloud PaaS. The benefits of automated machine learning and metalearning How important do you think automated machine learning and metalearning will be to the practice of developing AI/machine learning in 2019? What benefits do you think they will bring? Even ‘simple’ automation techniques like grid search and running multiple different algorithms on the same data are big wins when mastered. There is almost no telling which model is ‘right’ till you try it, so why not let a cloud of computers iterate through scores of algorithms and models to give you the best available answer? Artificial intelligence and ethics Do you think ethical considerations will become more relevant to developing AI/machine learning algorithms going forwards? If yes, how do you think this will be implemented? I think the ethical issues are important on outcomes, and on how models are used, but aren’t the place of algorithms themselves. If a developer was looking to start working with machine learning/AI, what tools and software would you suggest they learn in 2019? Python and PyTorch.
Read more
  • 0
  • 0
  • 30007
article-image-build-hadoop-clusters-using-google-cloud-platform-tutorial
Sunith Shetty
24 Jul 2018
10 min read
Save for later

Build Hadoop clusters using Google Cloud Platform [Tutorial]

Sunith Shetty
24 Jul 2018
10 min read
Cloud computing has transformed the way individuals and organizations access and manage their servers and applications on the internet. Before Cloud computing, everyone used to manage their servers and applications on their own premises or on dedicated data centers. The increase in the raw computing power of computing (CPU and GPU) of multiple-cores on a single chip and the increase in the storage space (HDD and SSD) present challenges in efficiently utilizing the available computing resources. In today's tutorial, we will learn different ways of building Hadoop cluster on the Cloud and ways to store and access data on Cloud. This article is an excerpt from a book written by Naresh Kumar and Prashant Shindgikar titled Modern Big Data Processing with Hadoop. Building Hadoop cluster in the Cloud Cloud offers a flexible and easy way to rent resources such as servers, storage, networking, and so on. The Cloud has made it very easy for consumers with the pay-as-you-go model, but much of the complexity of the Cloud is hidden from us by the providers. In order to better understand whether Hadoop is well suited to being on the Cloud, let's try to dig further and see how the Cloud is organized internally. At the core of the Cloud are the following mechanisms: A very large number of servers with a variety of hardware configurations Servers connected and made available over IP networks Large data centers to host these devices Data centers spanning geographies with evolved network and data center designs If we pay close attention, we are talking about the following: A very large number of different CPU architectures A large number of storage devices with a variety of speeds and performance Networks with varying speed and interconnectivity Let's look at a simple design of such a data center on the Cloud:We have the following devices in the preceding diagram: S1, S2: Rack switches U1-U6: Rack servers R1: Router Storage area network Network attached storage As we can see, Cloud providers have a very large number of such architectures to make them scalable and flexible. You would have rightly guessed that when the number of such servers increases and when we request a new server, the provider can allocate the server anywhere in the region. This makes it a bit challenging for compute and storage to be together but also provides elasticity. In order to address this co-location problem, some Cloud providers give the option of creating a virtual network and taking dedicated servers, and then allocating all their virtual nodes on these servers. This is somewhat closer to a data center design, but flexible enough to return resources when not needed. Let's get back to Hadoop and remind ourselves that in order to get the best from the Hadoop system, we should have the CPU power closer to the storage. This means that the physical distance between the CPU and the storage should be much less, as the BUS speeds match the processing requirements. The slower the I/O speed between the CPU and the storage (for example, iSCSI, storage area network, network attached storage, and so on) the poorer the performance we get from the Hadoop system, as the data is being fetched over the network, kept in memory, and then fed to the CPU for further processing. This is one of the important things to keep in mind when designing Hadoop systems on the Cloud. Apart from performance reasons, there are other things to consider: Scaling Hadoop Managing Hadoop Securing Hadoop Now, let's try to understand how we can take care of these in the Cloud environment. Hadoop can be installed by the following methods: Standalone Semi-distributed Fully-distributed When we want to deploy Hadoop on the Cloud, we can deploy it using the following ways: Custom shell scripts Cloud automation tools (Chef, Ansible, and so on) Apache Ambari Cloud vendor provided methods Google Cloud Dataproc Amazon EMR Microsoft HDInsight Third-party managed Hadoop Cloudera Cloud agnostic deployment Apache Whirr Google Cloud Dataproc In this section, we will learn how to use Google Cloud Dataproc to set up a single node Hadoop cluster. The steps can be broken down into the following: Getting a Google Cloud account. Activating Google Cloud Dataproc service. Creating a new Hadoop cluster. Logging in to the Hadoop cluster. Deleting the Hadoop cluster. Getting a Google Cloud account This section assumes that you already have a Google Cloud account. Activating the Google Cloud Dataproc service Once you log in to the Google Cloud console, you need to visit the Cloud Dataproc service. The activation screen looks something like this: Creating a new Hadoop cluster Once the Dataproc is enabled in the project, we can click on Create to create a new Hadoop cluster. After this, we see another screen where we need to configure the cluster parameters: I have left most of the things to their default values. Later, we can click on the Create button which creates a new cluster for us. Logging in to the cluster After the cluster has successfully been created, we will automatically be taken to the cluster lists page. From there, we can launch an SSH window to log in to the single node cluster we have created. The SSH window looks something like this: As you can see, the Hadoop command is readily available for us and we can run any of the standard Hadoop commands to interact with the system. Deleting the cluster In order to delete the cluster, click on the DELETE button and it will display a confirmation window, as shown in the following screenshot. After this, the cluster will be deleted: Looks so simple, right? Yes. Cloud providers have made it very simple for users to use the Cloud and pay only for the usage. Data access in the Cloud The Cloud has become an important destination for storing both personal data and business data. Depending upon the importance and the secrecy requirements of the data, organizations have started using the Cloud to store their vital datasets. The following diagram tries to summarize the various access patterns of typical enterprises and how they leverage the Cloud to store their data: Cloud providers offer different varieties of storage. Let's take a look at what these types are: Block storage File-based storage Encrypted storage Offline storage Block storage This type of storage is primarily useful when we want to use this along with our compute servers, and want to manage the storage via the host operating system. To understand this better, this type of storage is equivalent to the hard disk/SSD that comes with our laptops/MacBook when we purchase them. In case of laptop storage, if we decide to increase the capacity, we need to replace the existing disk with another one. When it comes to the Cloud, if we want to add more capacity, we can just purchase another larger capacity storage and attach it to our server. This is one of the reasons why the Cloud has become popular as it has made it very easy to add or shrink the storage that we need. It's good to remember that, since there are many different types of access patterns for our applications, Cloud vendors also offer block storage with varying storage/speed requirements measured with their own capacity/IOPS, and so on. Let's take an example of this capacity upgrade requirement and see what we do to utilize this block storage on the Cloud. In order to understand this, let's look at the example in this diagram: Imagine a server created by the administrator called DB1 with an original capacity of 100 GB. Later, due to unexpected demand from customers, an application started consuming all the 100 GB of storage, so the administrator has decided to increase the capacity to 1 TB (1,024 GB). This is what the workflow looks like in this scenario: Create a new 1 TB disk on the Cloud Attach the disk to the server and mount it Take a backup of the database Copy the data from the existing disk to the new disk Start the database Verify the database Destroy the data on the old disk and return the disk This process is simplified but in production this might take some time, depending upon the type of maintenance that is being performed by the administrator. But, from the Cloud perspective, acquiring new block storage is very quick. File storage Files are the basics of computing. If you are familiar with UNIX/Linux environments, you already know that, everything is a file in the Unix world. But don't get confused with that as every operating system has its own way of dealing with hardware resources. In this case we are not worried about how the operating system deals with hardware resources, but we are talking about the important documents that the users store as part of their day-to-day business. These files can be: Movie/conference recordings Pictures Excel sheets Word documents Even though they are simple-looking files in our computer, they can have significant business importance and should be dealt with in a careful fashion, when we think of storing these on the Cloud. Most Cloud providers offer an easy way to store these simple files on the Cloud and also offer flexibility in terms of security as well. A typical workflow for acquiring the storage of this form is like this: Create a new storage bucket that's uniquely identified Add private/public visibility to this bucket Add multi-geography replication requirement to the data that is stored in this bucket Some Cloud providers bill their customers based on the number of features they select as part of their bucket creation. Please choose a hard-to-discover name for buckets that contain confidential data, and also make them private. Encrypted storage This is a very important requirement for business critical data as we do not want the information to be leaked outside the scope of the organization. Cloud providers offer an encryption at rest facility for us. Some vendors choose to do this automatically and some vendors also provide flexibility in letting us choose the encryption keys and methodology for the encrypting/decrypting data that we own. Depending upon the organization policy, we should follow best practices in dealing with this on the Cloud. With the increase in the performance of storage devices, encryption does not add significant overhead while decrypting/encrypting files. This is depicted in the following image: Continuing the same example as before, when we choose to encrypt the underlying block storage of 1 TB, we can leverage the Cloud-offered encryption where they automatically encrypt and decrypt the data for us. So, we do not have to employ special software on the host operating system to do the encryption and decryption. Remember that encryption can be a feature that's available in both the block storage and file-based storage offer from the vendor. Cold storage This storage is very useful for storing important backups in the Cloud that are rarely accessed. Since we are dealing with a special type of data here, we should also be aware that the Cloud vendor might charge significantly high amounts for data access from this storage, as it's meant to be written once and forgetten (until it's needed). The advantage with this mechanism is that we have to pay lesser amounts to store even petabytes of data. We looked at the different steps involved in building our own Hadoop cluster on the Cloud. And we saw different ways of storing and accessing our data on the Cloud. To know more about how to build expert Big Data systems, do checkout this book Modern Big Data Processing with Hadoop. Read More: What makes Hadoop so revolutionary? Machine learning APIs for Google Cloud Platform Getting to know different Big data Characteristics
Read more
  • 0
  • 0
  • 29956

article-image-cnn-architecture
Packt
13 Feb 2017
14 min read
Save for later

CNN architecture

Packt
13 Feb 2017
14 min read
In this article by Giancarlo Zaccone, the author of the book Deep Learning with TensorFlow, we learn about multi-layer network the outputs of all neurons of the input layer would be connected to each neuron of the hidden layer (fully connected layer). (For more resources related to this topic, see here.) In CNN networks, instead, the connection scheme, that defines the convolutional layer that we are going to describe, is significantly different. As you can guess this is the main type of layer, the use of one or more of these layers in a convolutional neural network is indispensable. In a convolutional layer, each neuron is connected to a certain region of the input area called the receptive field. For example, using a 3×3 kernel filter, each neuron will have a bias and 9=3×3 weights connected to a single receptive field. Of course, to effectively recognize an image we need different kernel filters applied to the same receptive field, because each filter should recognize a different feature's image. The set of neurons that identify the same feature define a single feature map. The preceding figure shows a CNN architecture in action, the input image of 28×28 size will be analyzed by a convolutional layer composed of 32 feature map of 28×28 size. The figure also shows a receptive field and the kernel filter of 3×3 size. Figure: CNN in action A CNN may consist of several convolution layers connected in cascade. The output of each convolution layer is a set of feature maps (each generated by a single kernel filter), then all these matrices defines a new input that will be used by the next layer. CNNs also use pooling layers positioned immediately after the convolutional layers. A pooling layer divides a convolutional region in subregions and select a single representative value (max-pooling or average pooling) to reduce the computational time of subsequent layers and increase the robustness of the feature with respect its spatial position. The last hidden layer of a convolutional network is generally a fully connected network with softmax activation function for the output layer. A model for CNNs - LeNet Convolutional and max-pooling layers are at the heart of the LeNet family models. It is a family of multi-layered feedforward networks specialized on visual pattern recognition. While the exact details of the model will vary greatly, the following figure points out the graphical schema of a LeNet network: Figure: LeNet network In a LeNet model, the lower-layers are composed to alternating convolution and max-pooling, while the last layers are fully-connected and correspond to a traditional feed forward network (fully connected + softmax layer). The input to the first fully-connected layer is the set of all features maps at the layer below. From a TensorFlow implementation point of view, this means lower-layers operate on 4D tensors. These are then flattened to a 2D matrix to be compatible with a feed forward implementation. Build your first CNN In this section, we will learn how to build a CNN to classify images of the MNIST dataset. We will see that a simple softmax model provides about 92% classification accuracy for recognizing hand-written digits in the MNIST. Here we'll implement a CNN which has a classification accuracy of about 99%. The next figure shows how the data flow in the first two convolutional layer--the input image is processed in the first convolutional layer using the filter-weights. This results in 32 new images, one for each filter in the convolutional layer. The images are also dowsampled with the pooling operation so the image resolution is decreased from 28×28 to 14×14. These 32 smaller images are then processed in the second convolutional layer. We need filter-weights again for each of these 32 features, and we need filter-weights for each output channel of this layer. The images are again downsampled with a pooling operation so that the image resolution is decreased from 14×14 to 7×7. The total number of features for this convolutional layer is 64. Figure: Data flow of the first two convolutional layers The 64 resulting images are filtered again by a (3×3) third convolutional layer. We don't apply a pooling operation for this layer. The output of the second convolutional layer is 128 images of 7×7 pixels each. These are then flattened to a single vector of length 4×4×128, which is used as the input to a fully-connected layer with 128 neurons (or elements). This feeds into another fully-connected layer with 10 neurons, one for each of the classes, which is used to determine the class of the image, that is, which number is depicted in the following image: Figure: Data flow of the last three convolutional layers The convolutional filters are initially chosen at random. The error between the predicted and actual class of the input image is measured as the so-called cost function which generalize our network beyond training data. The optimizer then automatically propagates this error back through the convolutional network and updates the filter-weights to improve the classification error. This is done iteratively thousands of times until the classification error is sufficiently low. Now let's see in detail how to code our first CNN. Let's start by importing Tensorflow libraries for our implementation: import tensorflow as tf import numpy as np from tensorflow.examples.tutorials.mnist import input_data Set the following parameters, that indicate the number of samples to consider respectively for the training phase (128) and then in the test phase (256). batch_size = 128 test_size = 256 We define the following parameter, the value is 28 because a MNIST image, is 28 pixels in height and width: img_size = 28 And the number of classes; the value 10 means that we'll have one class for each of 10 digits: num_classes = 10 A placeholder variable, X, is defined for the input images. The data type for this tensor is set to float32 and the shape is set to [None, img_size, img_size, 1], where None means that the tensor may hold an arbitrary number of images: X = tf.placeholder("float", [None, img_size, img_size, 1]) Then we set another placeholder variable, Y, for the true labels associated with the images that were input data in the placeholder variable X. The shape of this placeholder variable is [None, num_classes] which means it may hold an arbitrary number of labels and each label is a vector of length num_classes which is 10 in this case. Y = tf.placeholder("float", [None, num_classes]) We collect the mnist data which will be copied into the data folder: mnist = mnist_data.read_data_sets("data/") We build the datasets for training (trX, trY) and testing the network (teX, teY). trX, trY, teX, teY = mnist.train.images, mnist.train.labels, mnist.test.images, mnist.test.labels The trX and teX image sets must be reshaped according the input shape: trX = trX.reshape(-1, img_size, img_size, 1) teX = teX.reshape(-1, img_size, img_size, 1) We shall now proceed to define the network's weights. The init_weights function builds new variables in the given shape and initializes network's weights with random values. def init_weights(shape): return tf.Variable(tf.random_normal(shape, stddev=0.01)) Each neuron of the first convolutional layer is convoluted to a small subset of the input tensor, of dimension 3×3×1, while the value 32 is just the number of feature map we are considering for this first layer. The weight w is then defined: w = init_weights([3, 3, 1, 32]) The number of inputs is then increased of 32, this means that each neuron of the second convolutional layer is convoluted to 3x3x32 neurons of the first convolution layer. The w2 weight is: w2 = init_weights([3, 3, 32, 64]) The value 64 represents the number of obtained output feature. The third convolutional layer is convoluted to 3x3x64 neurons of the previous layer, while 128 are the resulting features: w3 = init_weights([3, 3, 64, 128]) The fourth layer is fully connected, it receives 128x4x4 inputs, while the output is equal to 625: w4 = init_weights([128 * 4 * 4, 625]) The output layer receives625inputs, while the output is the number of classes: w_o = init_weights([625, num_classes]) Note that these initializations are not actually done at this point; they are merely being defined in the TensorFlow graph. p_keep_conv = tf.placeholder("float") p_keep_hidden = tf.placeholder("float") It's time to define the network model; as we did for the network's weights definition it will be a function. It receives as input, the X tensor, the weights tensors, and the dropout parameters for convolution and fully connected layer: def model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden): The tf.nn.conv2d() function executes the TensorFlow operation for convolution, note that the strides are set to 1 in all dimensions. Indeed, the first and last stride must always be 1, because the first is for the image-number and the last is for the input-channel. The padding parameter is set to 'SAME' which means the input image is padded with zeroes so the size of the output is the same: conv1 = tf.nn.conv2d(X, w,strides=[1, 1, 1, 1], padding='SAME') Then we pass the conv1 layer to a relu layer. It calculates the max(x, 0) funtion for each input pixel x, adding some non-linearity to the formula and allows us to learn more complicated functions: conv1 = tf.nn.relu(conv1) The resulting layer is then pooled by the tf.nn.max_pool operator: conv1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1] ,strides=[1, 2, 2, 1], padding='SAME') It is a 2×2 max-pooling, which means that we are considering 2×2 windows and select the largest value in each window. Then we move 2 pixels to the next window. We try to reduce the overfitting, via the tf.nn.dropout() function, passing the conv1layer and the p_keep_convprobability value: conv1 = tf.nn.dropout(conv1, p_keep_conv) As you can note the next two convolutional layers, conv2, conv3, are defined in the same way as conv1:   conv2 = tf.nn.conv2d(conv1, w2, strides=[1, 1, 1, 1], padding='SAME') conv2 = tf.nn.relu(conv2) conv2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') conv2 = tf.nn.dropout(conv2, p_keep_conv) conv3=tf.nn.conv2d(conv2, w3, strides=[1, 1, 1, 1] ,padding='SAME') conv3_a = tf.nn.relu(conv3) Two fully-connected layers are added to the network. The input of the first FC_layer is the convolution layer from the previous convolution: FC_layer = tf.nn.max_pool(conv3, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') FC_layer = tf.reshape(FC_layer, [-1,w4.get_shape().as_list()[0]]) A dropout function is again used to reduce the overfitting: FC_layer = tf.nn.dropout(FC_layer, p_keep_conv) The output layer receives the input as FC_layer and the w4 weight tensor. A relu and a dropout operator are respectively applied: output_layer = tf.nn.relu(tf.matmul(FC_layer, w4)) output_layer = tf.nn.dropout(output_layer, p_keep_hidden) The result is a vector of length 10 for determining which one of the 10 classes for the input image belongs to: result = tf.matmul(output_layer, w_o) return result The cross-entropy is the performance measure we used in this classifier. The cross-entropy is a continuous function that is always positive and is equal to zero, if the predicted output exactly matches the desired output. The goal of this optimization is therefore to minimize the cross-entropy so it gets, as close to zero as possible, by changing the variables of the network layers. TensorFlow has a built-in function for calculating the cross-entropy. Note that the function calculates the softmax internally so we must use the output of py_x directly: py_x = model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden) Y_ = tf.nn.softmax_cross_entropy_with_logits(py_x, Y) Now that we have defined the cross-entropy for each classified image, we have a measure of how well the model performs on each image individually. But using the cross-entropy to guide the optimization of the networks's variables we need a single scalar value, so we simply take the average of the cross-entropy for all the classified images: cost = tf.reduce_mean(Y_) To minimize the evaluated cost, we must define an optimizer. In this case, we adopt the implemented RMSPropOptimizer which is an advanced form of gradient descent. RMSPropOptimizer implements the RMSProp algorithm, that is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera class. You find George Hinton's course in https://www.coursera.org/learn/neural-networks RMSPropOptimizeras well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests setting the decay parameter to 0.9, while a good default value for the learning rate is 0.001. optimizer = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cost) Basically, the common Stochastic Gradient Descent (SGD) algorithm has a problem in that learning rates must scale with 1/T to get convergence, where T is the iteration number. RMSProp tries to get around this by automatically adjusting the step size so that the step is on the same scale as the gradients as the average gradient gets smaller, the coefficient in the SGD update gets bigger to compensate. An interesting reference about this algorithm can be found here: http://www.cs.toronto.edu/%7Etijmen/csc321/slides/lecture_slides_lec6.pdf Finally, we define predict_op that is the index with the largest value across dimensions from the output of the mode: predict_op = tf.argmax(py_x, 1) Note that optimization is not performed at this point. Nothing is calculated at all; we'll just add the optimizer object to the TensorFlow graph for later execution. We now come to define the network's running session, there are 55,000 images in the training set, so it takes a long time to calculate the gradient of the model using all these images. Therefore we'll use a small batch of images in each iteration of the optimizer. If your computer crashes or becomes very slow because you run out of RAM, then you may try and lower this number, but you may then need to perform more optimization iterations. Now we can proceed to implement a TensorFlow session: with tf.Session() as sess: tf.initialize_all_variables().run() for i in range(100): We get a batch of training examples, the tensor training_batch now holds a subset of images and corresponding labels: training_batch = zip(range(0, len(trX), batch_size), range(batch_size, len(trX)+1, batch_size)) Put the batch into feed_dict with the proper names for placeholder variables in the graph. We run the optimizer using this batch of training data, TensorFlow assigns the variables in feed to the placeholder variables and then runs the optimizer: for start, end in training_batch: sess.run(optimizer, feed_dict={X: trX[start:end], Y: trY[start:end], p_keep_conv: 0.8, p_keep_hidden: 0.5}) At the same time we get a shuffled batch of test samples: test_indices = np.arange(len(teX)) np.random.shuffle(test_indices) test_indices = test_indices[0:test_size] For each iteration we display the accuracy evaluated on the batch set: print(i, np.mean(np.argmax(teY[test_indices], axis=1) == sess.run (predict_op, feed_dict={X: teX[test_indices], Y: teY[test_indices], p_keep_conv: 1.0, p_keep_hidden: 1.0}))) Training a network can take several hours depending on the used computational resources. The results on my machine is as follows: Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes. Successfully extracted to train-images-idx3-ubyte.mnist 9912422 bytes. Loading ata/train-images-idx3-ubyte.mnist Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes. Successfully extracted to train-labels-idx1-ubyte.mnist 28881 bytes. Loading ata/train-labels-idx1-ubyte.mnist Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes. Successfully extracted to t10k-images-idx3-ubyte.mnist 1648877 bytes. Loading ata/t10k-images-idx3-ubyte.mnist Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes. Successfully extracted to t10k-labels-idx1-ubyte.mnist 4542 bytes. Loading ata/t10k-labels-idx1-ubyte.mnist (0, 0.95703125) (1, 0.98046875) (2, 0.9921875) (3, 0.99609375) (4, 0.99609375) (5, 0.98828125) (6, 0.99609375) (7, 0.99609375) (8, 0.98828125) (9, 0.98046875) (10, 0.99609375) . . . .. . (90, 1.0) (91, 0.9921875) (92, 0.9921875) (93, 0.99609375) (94, 1.0) (95, 0.98828125) (96, 0.98828125) (97, 0.99609375) (98, 1.0) (99, 0.99609375) After 10,000 iterations, the model has an accuracy of about 99%....no bad!! Summary In this article, we introduced Convolutional Neural Networks (CNNs). We have seen how the architecture of these networks, yield CNNs, particularly suitable for image classification problems, making faster the training phase and more accurate the test phase. We have therefore implemented an image classifier, testing it on MNIST data set, where have achieved a 99% accuracy. Finally, we built a CNN to classify emotions starting from a dataset of images; we tested the network on a single image and we evaluated the limits and the goodness of our model. Resources for Article: Further resources on this subject: Getting Started with Deep Learning [article] Practical Applications of Deep Learning [article] Deep learning in R [article]
Read more
  • 0
  • 0
  • 29872
Modal Close icon
Modal Close icon