Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-labview-basics
Packt
02 Nov 2016
8 min read
Save for later

LabVIEW Basics

Packt
02 Nov 2016
8 min read
In this article by Behzad Ehsani, author of the book Data Acquisition using LabVIEW, after a brief introduction and a short note on installation, we will go over the most widely used pallets and objects Icon tool bar from a standard installation of LabVIEW and a brief explanation of what each object does. (For more resources related to this topic, see here.) Introduction to LabVIEW LabVIEW is a graphical developing and testing environment unlike any other test and development tool available in the industry. LabVIEW sets itself apart from traditional programming environment by its complete graphical approach to programming. As an example, while representation of a while loop in a text based language such as the C language consists of several predefined, extremely compact and sometimes extremely cryptic lines of text, a while a loop in LabVIEW, is actually a graphical loop. The environment is extremely intuitive and powerful, which makes for a short learning cure for the beginner. LabVIEW is based on what is called G language, but there are still other languages, especially C under the hood. However, the ease of use and power of LabVIEW is somewhat deceiving to a novice user. Many people have attempt to start projects in LabVIEW only because at the first glace, the graphical nature of interface and the concept of drag an drop used in LabVIEW, appears to do away with required basics of programming concepts and classical education in programming science and engineering. This is far from the reality of using LabVIEW as the predominant development environment. While it is true that in many higher level development and testing environment, specially when using complicated test equipment and complex mathematical calculations or even creating embedded software LabVIEW's approach will be much more time efficient and bug free environment which otherwise would require several lines of code in traditional text based programming environment, one must be aware of LabVIEW's strengths and possible weaknesses. LabVIEW does not completely replace the need for traditional text based languages and depending on the entire nature of a project, LabVIEW or another traditional text based language such as C may be the most suitable programming or test environment. Installing LabVIEW Installation of LabVIEW is very simple and it is just as routine as any modern day program installation; that is, Insert the DVD 1 and follow on-screen guided installation steps. LabVIEW comes in one DVD for Mac and Linux version but in four or more DVDs for the Windows edition (depending on additional software, different licensing and additional libraries and packages purchased.) In this article we will use LabVIEW 2013 Professional Development version for Windows. Given the target audience of this article, we assume the user is well capable of installation of the program. Installation is also well documented by National Instruments and the mandatory one year support purchase with each copy of LabVIEW is a valuable source of live and email help. Also, NI web site (www.ni.com) has many user support groups that are also a great source of support, example codes, discussion groups and local group events and meeting of fellow LabVIEW developers, etc. One worthy note for those who are new to installation of LabVIEW is that the installation DVDs include much more than what an average user would need and pay for. We do strongly suggest that you install additional software (beyond what has been purchased and licensed or immediate need!) These additional software are fully functional (in demo mode for 7 days) which may be extended for about a month with online registration. This is a very good opportunity to have hands on experience with even more of power and functionality that LabVIEW is capable to offer. The additional information gained by installing other available software on the DVDs may help in further development of a given project. Just imagine if the current development of a robot only encompasses mechanical movements and sensors today, optical recognition probably is going to follow sooner than one may think. If data acquisition using expensive hardware and software may be possible in one location, the need to web sharing and remote control of the setup is just around the corner. It is very helpful to at least be aware of what packages are currently available and be able to install and test them prior to a full purchase and implementation. The following screenshot shows what may be installed if almost all software on all DVDs are selected: When installing a fresh version of LabVIEW, if you do decide to observe the advice above, make sure to click on the + sign next to each package you decide to install and prevent any installation of LabWindows/CVI.... and Measurement Studio... for Visual Studio LabWindows according to National Instruments .., is an ANSI C integrated development environment. Also note that by default NI device drivers are not selected to be installed. Device drivers are an essential part of any data acquisition and appropriate drivers for communications and instrument(s) control must be installed before LabVIEW can interact with external equipments. Also, note that device drivers (on Windows installations) come on a separate DVD which means that one does not have to install device drivers at the same time that the main application and other modules are installed; they can be installed at any time later on. Almost all well established vendors are packaging their product with LabVIEW drivers and example codes. If a driver is not readily available, National Instruments has programmers that would do just that. But this would come at a cost to the user. VI Package manager, now installed as a part of standard installation is also a must these days. National Instruments distributes third party software and drivers and public domain packages via VI Package manager. Appropriate software and drivers for these microcontrollers are installed via VI Package manager. You can install many public domain packages that further installs many useful LabVIEW toolkits to a LabVIEW installation and can be used just as those that are delivered professionally by National Instruments. Finally, note that the more modules, packages and software are selected to be installed the longer it will take to complete the installation. This may sound like making an obvious point but surprisingly enough installation of all software on the three DVDs (for Windows) take up over five hours! On a standard laptop or pc we used. Obviously a more powerful PC (such as one with solid sate hard drive) my not take such log time: LabVIEW Basics Once the LabVIEW applications is launched, by default two blank windows open simultaneously; a Front Panel and a Block Diagram window and a VI is created: VIs or Virtual Instruments are heart and soul of LabVIEW. They are what separate LabVIEW from all other text based development environments. In LabVIEW everything is an object which is represented graphically. A VI may only consist of a few objects or hundreds of objects embedded in many subVIs These graphical representation of a thing, be it a simple while loop, a complex mathematical concept such as polynomial interpolation or simply a Boolean constant are all graphically represented. To use an object right-click inside the block diagram or front panel window, a pallet list appears. Follow the arrow and pick an object from the list of object from subsequent pallet an place it on the appropriate window. The selected object now can be dragged and place on different location on the appropriate window and is ready to be wired. Depending on what kind of object is selected, a graphical representation of the object appears on both windows. Of cores there are many exceptions to this rule. For example a while loop can only be selected in Block Diagram and by itself, a while loop does not have a graphical representation on the front panel window. Needless to say, LabVIEW also has keyboard combination that expedite selecting and placing any given toolkit objects onto the appropriate window. Each object has one (or several) wire connections going into as input(s) and coming out as its output(s). A VI becomes functional when a minimum number of wires are appropriately connected to input and output of one or more object. Later, we will use an example to illustrate how a basic LabVIEW VI is created and executed. Highlights LabVIEW is a complete object-oriented development and test environment based on G language. As such it is a very powerful and complex environment. In article one we went through introduction to LabVIEW and its main functionality of each of its icon by way of an actual user interactive example. Accompanied by appropriate hardware (both NI as well as many industry standard test, measurement and development hardware products) LabVIEW is capable to cover from developing embedded systems to fuzzy logic and almost everything in between! Summary In this article we cover the basics of LabVIEW, from installation to in depth explanation of each and every element in the toolbar. Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Data mining [article] PostgreSQL in Action [article]
Read more
  • 0
  • 0
  • 3704

article-image-jupyter-and-python-scripting
Packt
21 Oct 2016
9 min read
Save for later

Jupyter and Python Scripting

Packt
21 Oct 2016
9 min read
In this article by Dan Toomey, author of the book Learning Jupyter, we will see data access in Jupyter with Python and the effect of pandas on Jupyter. We will also see Python graphics and lastly Python random numbers. (For more resources related to this topic, see here.) Python data access in Jupyter I started a view for pandas using Python Data Access as the name. We will read in a large dataset and compute some standard statistics on the data. We are interested in seeing how we use pandas in Jupyter, how well the script performs, and what information is stored in the metadata (especially if it is a larger dataset). Our script accesses the iris dataset built into one of the Python packages. All we are looking to do is read in a slightly large number of items and calculate some basic operations on the dataset. We are really interested in seeing how much of the data is cached in the PYNB file. The Python code is: # import the datasets package from sklearn import datasets # pull in the iris data iris_dataset = datasets.load_iris() # grab the first two columns of data X = iris_dataset.data[:, :2] # calculate some basic statistics x_count = len(X.flat) x_min = X[:, 0].min() - .5 x_max = X[:, 0].max() + .5 x_mean = X[:, 0].mean() # display our results x_count, x_min, x_max, x_mean I broke these steps into a couple of cells in Jupyter, as shown in the following screenshot: Now, run the cells (using Cell | Run All) and you get this display below. The only difference is the last Out line where our values are displayed. It seemed to take longer to load the library (the first time I ran the script) than to read the data and calculate the statistics. If we look in the PYNB file for this notebook, we see that none of the data is cached in the PYNB file. We simply have code references to the library, our code, and the output from when we last calculated the script: { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(300, 3.7999999999999998, 8.4000000000000004, 5.8433333333333337)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate some basic statisticsn", "x_count = len(X.flat)n", "x_min = X[:, 0].min() - .5n", "x_max = X[:, 0].max() + .5n", "x_mean = X[:, 0].mean()n", "n", "# display our resultsn", "x_count, x_min, x_max, x_mean" ] } Python pandas in Jupyter One of the most widely used features of Python is pandas. pandas are built-in libraries of data analysis packages that can be used freely. In this example, we will develop a Python script that uses pandas to see if there is any effect to using them in Jupyter. I am using the Titanic dataset from http://www.kaggle.com/c/titanic-gettingStarted/download/train.csv. I am sure the same data is available from a variety of sources. Here is our Python script that we want to run in Jupyter: from pandas import * training_set = read_csv('train.csv') training_set.head() male = training_set[training_set.sex == 'male'] female = training_set[training_set.sex =='female'] womens_survival_rate = float(sum(female.survived))/len(female) mens_survival_rate = float(sum(male.survived))/len(male) The result is… we calculate the survival rates of the passengers based on sex. We create a new notebook, enter the script into appropriate cells, include adding displays of calculated data at each point and produce our results. Here is our notebook laid out where we added displays of calculated data at each cell,as shown in the following screenshot: When I ran this script, I had two problems: On Windows, it is common to use backslash ("") to separate parts of a filename. However, this coding uses the backslash as a special character. So, I had to change over to use forward slash ("/") in my CSV file path. I originally had a full path to the CSV in the above code example. The dataset column names are taken directly from the file and are case sensitive. In this case, I was originally using the 'sex' field in my script, but in the CSV file the column is named Sex. Similarly I had to change survived to Survived. The final script and result looks like the following screenshot when we run it: I have used the head() function to display the first few lines of the dataset. It is interesting… the amount of detail that is available for all of the passengers. If you scroll down, you see the results as shown in the following screenshot: We see that 74% of the survivors were women versus just 19% men. I would like to think chivalry is not dead! Curiously the results do not total to 100%. However, like every other dataset I have seen, there is missing and/or inaccurate data present. Python graphics in Jupyter How do Python graphics work in Jupyter? I started another view for this named Python Graphics so as to distinguish the work. If we were to build a sample dataset of baby names and the number of births in a year of that name, we could then plot the data. The Python coding is simple: import pandas import matplotlib %matplotlib inline baby_name = ['Alice','Charles','Diane','Edward'] number_births = [96, 155, 66, 272] dataset = list(zip(baby_name,number_births)) df = pandas.DataFrame(data = dataset, columns=['Name', 'Number']) df['Number'].plot() The steps of the script are as follows: We import the graphics library (and data library) that we need Define our data Convert the data into a format that allows for easy graphical display Plot the data We would expect a resultant graph of the number of births by baby name. Taking the above script and placing it into cells of our Jupyter node, we get something that looks like the following screenshot: I have broken the script into different cells for easier readability. Having different cells also allows you to develop the script easily step by step, where you can display the values computed so far to validate your results. I have done this in most of the cells by displaying the dataset and DataFrame at the bottom of those cells. When we run this script (Cell | Run All), we see the results at each step displayed as the script progresses: And finally we see our plot of the births as shown in the following screenshot. I was curious what metadata was stored for this script. Looking into the IPYNB file, you can see the expected value for the formula cells. The tabular data display of the DataFrame is stored as HTML—convenient: { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>n", "<table border="1" class="dataframe">n", "<thead>n", "<tr style="text-align: right;">n", "<th></th>n", "<th>Name</th>n", "<th>Number</th>n", "</tr>n", "</thead>n", "<tbody>n", "<tr>n", "<th>0</th>n", "<td>Alice</td>n", "<td>96</td>n", "</tr>n", "<tr>n", "<th>1</th>n", "<td>Charles</td>n", "<td>155</td>n", "</tr>n", "<tr>n", "<th>2</th>n", "<td>Diane</td>n", "<td>66</td>n", "</tr>n", "<tr>n", "<th>3</th>n", "<td>Edward</td>n", "<td>272</td>n", "</tr>n", "</tbody>n", "</table>n", "</div>" ], "text/plain": [ " Name Numbern", "0 Alice 96n", "1 Charles 155n", "2 Diane 66n", "3 Edward 272" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], The graphic output cell that is stored like this: { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x47cf8f0>" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "<a few hundred lines of hexcodes> …/wc/B0RRYEH0EQAAAABJRU5ErkJggg==n", "text/plain": [ "<matplotlib.figure.Figure at 0x47d8e30>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the datan", "df['Number'].plot()n" ] } ], Where the image/png tag contains a large hex digit string representation of the graphical image displayed on screen (I abbreviated the display in the coding shown). So, the actual generated image is stored in the metadata for the page. Python random numbers in Jupyter For many analyses we are interested in calculating repeatable results. However, much of the analysis relies on some random numbers to be used. In Python, you can set the seed for the random number generator to achieve repeatable results with the random_seed() function. In this example, we simulate rolling a pair of dice and looking at the outcome. We would example the average total of the two dice to be 6—the halfway point between the faces. The script we are using is this: import pylab import random random.seed(113) samples = 1000 dice = [] for i in range(samples): total = random.randint(1,6) + random.randint(1,6) dice.append(total) pylab.hist(dice, bins= pylab.arange(1.5,12.6,1.0)) pylab.show() Once we have the script in Jupyter and execute it, we have this result: I had added some more statistics. Not sure if I would have counted on such a high standard deviation. If we increased the number of samples, this would decrease. The resulting graph was opened in a new window, much as it would if you ran this script in another Python development environment. The toolbar at the top of the graphic is extensive, allowing you to manipulate the graphic in many ways. Summary In this article, we walked through simple data access in Jupyter through Python. Then we saw an example of using pandas. We looked at a graphics example. Finally, we looked at an example using random numbers in a Python script. Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Mining Twitter with Python – Influence and Engagement [article] Unsupervised Learning [article]
Read more
  • 0
  • 0
  • 34017

article-image-data-science-venn-diagram
Packt
21 Oct 2016
15 min read
Save for later

The Data Science Venn Diagram

Packt
21 Oct 2016
15 min read
It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. In this article by Sinan Ozdemir, author of the book Principles of Data Science, we will discuss how data science begins with three basic areas: Math/statistics: This is the use of equations and formulas to perform analysis Computer programming: This is the ability to use code to create outcomes on the computer Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, and so on) (For more resources related to this topic, see here.) The following Venn diagram provides a visual representation of how the three areas of data science intersect: The Venn diagram of data science Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics knowledge base allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive (domain) expertise allows you to apply concepts and results in a meaningful and effective way. While having only two of these three qualities can make you intelligent, it will also leave a gap. Consider that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place but lack the math skills to evaluate your algorithms and, therefore, end up losing money in the long run. It is only when you can boast skills in coding, math, and domain knowledge, can you truly perform data science. The one that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers. Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses' place in the domain we are in. This includes presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist. Also, note that the intersection of math and coding is machine learning, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just as algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data but if you don't understand how to apply this model in a practical sense such that doctors and nurses can easily use it, your model might be useless. Domain knowledge comes with both practice of data science and reading examples of other people's analyses. The math Most people stop listening once someone says the word "math". They'll nod along in an attempt to hide their utter disdain for the topic. We will use these subdomains of mathematics to create what are called models. A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon. Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding theory allows us to apply a model that we built for the fashion industry to a financial model. Every mathematical concept I introduce, I do so with care, examples, and purpose. The math in this article is essential for data scientists. Example – Spawner-Recruit Models In biology, we use, among many others, a model known as the Spawner-Recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the following graph was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that group would obtain and vice versa? Essentially, models allow us to plug in one variable to get the other. Consider the following example: In this example, let's say we knew that a group of salmons had 1.15 (in thousands) of spawners. Then, we would have the following: This result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change. There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the "best" model possible. We no longer rely on human instincts, rather, we rely on data. Spawner-Recruit model visualized The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible. Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere. Computer programming Let's be honest. You probably think computer science is way cooler than math. That's ok, I don't blame you. The news isn't filled with math news like it is with news on the technological front. You don't turn on the TV to see a new theory on primes, rather you will see investigative reports on how the latest smartphone can take photos of cats better or something. Computer languages are how we communicate with the machine and tell it to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages available to us. This article will focus exclusively on using Python. Why Python? We will use Python for a variety of reasons: Python is an extremely simple language to read and write even if you've coded before, which will make future examples easy to ingest and read later. It is one of the most common languages in production and in the academic setting (one of the fastest growing as a matter of fact). The online community of the language is vast and friendly. This means that a quick Google search should yield multiple results of people who have faced and solved similar (if not exact) situations. Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize. The last is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful but also easy to pick up. Some of these modules are as follows: pandas sci-kit learn seaborn numpy/scipy requests (to mine data from the web) BeautifulSoup (for Web HTML parsing) Python practices Before we move on, it is important to formalize many of the requisite coding skills in Python. In Python, we have variables thatare placeholders for objects. We will focus on only a few types of basic objects at first: int (an integer) Examples: 3, 6, 99, -34, 34, 11111111 float (a decimal): Examples: 3.14159, 2.71, -0.34567 boolean (either true or false) The statement, Sunday is a weekend, is true The statement, Friday is a weekend, is false The statement, pi is exactly the ratio of a circle's circumference to its diameter, is true (crazy, right?) string (text or words made up of characters) I love hamburgers (by the way who doesn't?) Matt is awesome A Tweet is a string a list (a collection of objects) Example: 1, 5.4, True, "apple" We will also have to understand some basic logistical operators. For these operators, keep the boolean type in mind. Every operator will evaluate to either true or false. == evaluates to true if both sides are equal, otherwise it evaluates to false 3 + 4 == 7     (will evaluate to true) 3 – 2 == 7     (will evaluate to false) <  (less than) 3  < 5             (true) 5  < 3             (false) <= (less than or equal to) 3  <= 3             (true) 5  <= 3             (false) > (greater than) 3  > 5             (false) 5  > 3             (true) >= (greater than or equal to) 3  >= 3             (true) 5  >= 3             (false) When coding in Python, I will use a pound sign (#) to create a comment, which will not be processed as code but is merely there to communicate with the reader. Anything to the right of a # is a comment on the code being executed. Example of basic Python In Python, we use spaces/tabs to denote operations that belong to other lines of code. Note the use of the if statement. It means exactly what you think it means. When the statement after the if statement is true, then the tabbed part under it will be executed, as shown in the following code: X = 5.8 Y = 9.5 X + Y == 15.3 # This is True! X - Y == 15.3 # This is False! if x + y == 15.3: # If the statement is true: print "True!" # print something! The print "True!" belongs to the if x + y == 15.3: line preceding it because it is tabbed right under it. This means that the print statement will be executed if and only if x + y equals 15.3. Note that the following list variable, my_list, can hold multiple types of objects. This one has an int, a float, boolean, and string (in that order): my_list = [1, 5.7, True, "apples"] len(my_list) == 4 # 4 objects in the list my_list[0] == 1 # the first object my_list[1] == 5.7 # the second object In the preceding code: I used the len command to get the length of the list (which was four). Note the zero-indexing of Python. Most computer languages start counting at zero instead of one. So if I want the first element, I call the index zero and if I want the 95th element, I call the index 94. Example – parsing a single Tweet Here is some more Python code. In this example, I will be parsing some tweets about stock prices: tweet = "RT @j_o_n_dnger: $TWTR now top holding for Andor, unseating $AAPL" words_in_tweet = first_tweet.split(' ') # list of words in tweet for word in words_in_tweet: # for each word in list if "$" in word: # if word has a "cashtag" print "THIS TWEET IS ABOUT", word # alert the user I will point out a few things about this code snippet, line by line, as follows: We set a variable to hold some text (known as a string in Python). In this example, the tweet in question is "RT @robdv: $TWTR now top holding for Andor, unseating $AAPL" The words_in_tweet variable "tokenizes" the tweet (separates it by word). If you were to print this variable, you would see the following: "['RT', '@robdv:', '$TWTR', 'now', 'top', 'holding', 'for', 'Andor,', 'unseating', '$AAPL'] We iterate through this list of words. This is called a for loop. It just means that we go through a list one by one. Here, we have another if statement. For each word in this tweet, if the word contains the $ character (this is how people reference stock tickers on twitter). If the preceding if statement is true (that is, if the tweet contains a cashtag), print it and show it to the user. The output of this code will be as follows: We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this article, I will ensure that I am as explicit as possible about what I am doing in each line of code. Domain knowledge As I mentioned earlier, this category focuses mainly on having knowledge about the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field. Does that mean that if you're not a doctor, you can't work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren't fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete. A big part of domain knowledge is presentation. Depending on your audience, it can greatly matter how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused. Some more terminology This is a good time to define some more vocabulary. By this point, you're probably excitedly looking up a lot of data science material and seeing words and phrases I haven't used yet. Here are some common terminologies you are likely to come across: Machine learning: This refers to giving computers the ability to learn from data without explicit "rules" being given by a programmer. Machine learning combines the power of computers with intelligent learning algorithms in order to automate the discovery of relationships in data and creation of powerful data models. Speaking of data models, we will concern ourselves with the following two basic types of data models: Probabilistic model: This refers to using probability to find a relationship between elements that includes a degree of randomness Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula While both the statistical and probabilistic models can be run on computers and might be considered machine learning in that regard, we will keep these definitions separate as machine learning algorithms generally attempt to learn relationships in different ways. Exploratory data analysis – This refers to preparing data in order to standardize results and gain quick insights Exploratory data analysis (EDA) is concerned with data visualization and preparation. This is where we turn unorganized data into organized data and also clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots in order to identify key features and relationships to exploit in our data models. Data mining – This is the process of finding relationships between elements of data. Data mining is the part of Data science where we try to find relationships between variables (think spawn-recruit model). I tried pretty hard not to use the term big data up until now. It's because I think this term is misused, a lot. While the definition of this word varies from person to person. Big datais data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data). The state of data science so far (this diagram is incomplete and is meant for visualization purposes only). Summary More and more people are jumping headfirst into the field of data science, most with no prior experience in math or CS, which on the surface is great. Average data scientists have access to millions of dating profiles' data, tweets, online reviews, and much more in order to jumpstart their education. However, if you jump into data science without the proper exposure to theory or coding practices and without respect of the domain you are working in, you face the risk of oversimplifying the very phenomenon you are trying to model. Resources for Article: Further resources on this subject: Reconstructing 3D Scenes [article] Basics of Classes and Objects [article] Saying Hello! [article]
Read more
  • 0
  • 0
  • 12459

article-image-heart-diseases-prediction-using-spark-200
Packt
18 Oct 2016
16 min read
Save for later

Heart Diseases Prediction using Spark 2.0.0

Packt
18 Oct 2016
16 min read
In this article, Md. Rezaul Karim and Md. Mahedi Kaysar, the authors of the book Large Scale Machine Learning with Spark discusses how to develop a large scale heart diseases prediction pipeline by considering steps like taking input, parsing, making label point for regression, model training, model saving and finally predictive analytics using the trained model using Spark 2.0.0. In this article, they will develop a large-scale machine learning application using several classifiers like the random forest, decision tree, and linear regression classifier. To make this happen the following steps will be covered: Data collection and exploration Loading required packages and APIs Creating an active Spark session Data parsing and RDD of Label point creation Splitting the RDD of label point into training and test set Training the model Model saving for future use Predictive analysis using the test set Predictive analytics using the new dataset Performance comparison among different classifier (For more resources related to this topic, see here.) Background Machine learning in big data together is a radical combination that has created some great impacts in the field of research to academia and industry as well in the biomedical sector. In the area of biomedical data analytics, this carries a better impact on a real dataset for diagnosis and prognosis for better healthcare. Moreover, the life science research is also entering into the Big data since datasets are being generated and produced in an unprecedented way. This imposes great challenges to the machine learning and bioinformatics tools and algorithms to find the VALUE out of the big data criteria like volume, velocity, variety, veracity, visibility and value. In this article, we will show how to predict the possibility of future heart disease by using Spark machine learning APIs including Spark MLlib, Spark ML, and Spark SQL. Data collection and exploration In the recent time, biomedical research has gained lots of advancement and more and more life sciences data set are being generated making many of them open. However, for the simplicity and ease, we decided to use the Cleveland database. Because till date most of the researchers who have applied the machine learning technique to biomedical data analytics have used this dataset. According to the dataset description at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names, the heart disease dataset is one of the most used as well as very well-studied datasets by the researchers from the biomedical data analytics and machine learning respectively. The dataset is freely available at the UCI machine learning dataset repository at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. This data contains total 76 attributes, however, most of the published research papers refer to use a subset of only 14 feature of the field. The goal field is used to refer if the heart diseases are present or absence. It has 5 possible values ranging from 0 to 4. The value 0 signifies no presence of heart diseases. The value 1 and 2 signify that the disease is present but in the primary stage. The value 3 and 4, on the other hand, indicate the strong possibility of the heart disease. Biomedical laboratory experiments with the Cleveland dataset have determined on simply attempting to distinguish presence (values 1, 2, 3, 4) from absence (value 0). In short, the more the value the more possibility and evidence of the presence of the disease. Another thing is that the privacy is an important concern in the area of biomedical data analytics as well as all kind of diagnosis and prognosis. Therefore, the names and social security numbers of the patients were recently removed from the dataset to avoid the privacy issue. Consequently, those values have been replaced with dummy values instead. It is to be noted that three files have been processed, containing the Cleveland, Hungarian, and Switzerland datasets altogether. All four unprocessed files also exist in this directory. To demonstrate the example, we will use the Cleveland dataset for training evaluating the models. However, the Hungarian dataset will be used to re-use the saved model. As said already that although the number of attributes is 76 (including the predicted attribute). However, like other ML/Biomedical researchers, we will also use only 14 attributes with the following attribute information:  No. Attribute name Explanation 1 age Age in years 2 sex Either male or female: sex (1 = male; 0 = female) 3 cp Chest pain type: -- Value 1: typical angina -- Value 2: atypical angina -- Value 3: non-angina pain -- Value 4: asymptomatic 4 trestbps Resting blood pressure (in mm Hg on admission to the hospital) 5 chol Serum cholesterol in mg/dl 6 fbs Fasting blood sugar. If > 120 mg/dl)(1 = true; 0 = false) 7 restecg Resting electrocardiographic results: -- Value 0: normal -- Value 1: having ST-T wave abnormality -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria. 8 thalach Maximum heart rate achieved 9 exang Exercise induced angina (1 = yes; 0 = no) 10 oldpeak ST depression induced by exercise relative to rest 11 slope The slope of the peak exercise ST segment    -- Value 1: upsloping    -- Value 2: flat    -- Value 3: down-sloping 12 ca Number of major vessels (0-3) coloured by fluoroscopy 13 thal Heart rate: ---Value 3 = normal; ---Value 6 = fixed defect ---Value 7 = reversible defect 14 num Diagnosis of heart disease (angiographic disease status) -- Value 0: < 50% diameter narrowing -- Value 1: > 50% diameter narrowing Table 1: Dataset characteristics Note there are several missing attribute values distinguished with value -9.0. In the Cleveland dataset contains the following class distribution: Database:     0       1     2     3   4   Total Cleveland:   164   55   36   35 13   303 A sample snapshot of the dataset is given as follows: Figure 1: Snapshot of the Cleveland's heart diseases dataset Loading required packages and APIs The following packages and APIs need to be imported for our purpose. We believe the packages are self-explanatory if you have minimum working experience with Spark 2.0.0.: import java.util.HashMap; import java.util.List; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.ml.classification.LogisticRegression; import org.apache.spark.mllib.classification.LogisticRegressionModel; import org.apache.spark.mllib.classification.NaiveBayes; import org.apache.spark.mllib.classification.NaiveBayesModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.regression.LinearRegressionModel; import org.apache.spark.mllib.regression.LinearRegressionWithSGD; import org.apache.spark.mllib.tree.DecisionTree; import org.apache.spark.mllib.tree.RandomForest; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.rdd.RDD; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import com.example.SparkSession.UtilityForSparkSession; import javassist.bytecode.Descriptor.Iterator; import scala.Tuple2; Creating an active Spark session SparkSession spark = UtilityForSparkSession.mySession(); Here is the UtilityForSparkSession class that creates and returns an active Spark session: import org.apache.spark.sql.SparkSession; public class UtilityForSparkSession { public static SparkSession mySession() { SparkSession spark = SparkSession .builder() .appName("UtilityForSparkSession") .master("local[*]") .config("spark.sql.warehouse.dir", "E:/Exp/") .getOrCreate(); return spark; } } Note that here in Windows 7 platform, we have set the Spark SQL warehouse as "E:/Exp/", set your path accordingly based on your operating system. Data parsing and RDD of Label point creation Taken input as simple text file, parse them as text file and create RDD of label point that will be used for the classification and regression analysis. Also specify the input source and number of partition. Adjust the number of partition based on your dataset size. Here number of partition has been set to 2: String input = "heart_diseases/processed_cleveland.data"; Dataset<Row> my_data = spark.read().format("com.databricks.spark.csv").load(input); my_data.show(false); RDD<String> linesRDD = spark.sparkContext().textFile(input, 2); Since, JavaRDD cannot be created directly from the text files; rather we have created the simple RDDs, so that we can convert them as JavaRDD when necessary. Now let's create the JavaRDD with Label Point. However, we need to convert the RDD to JavaRDD to serve our purpose that goes as follows: JavaRDD<LabeledPoint> data = linesRDD.toJavaRDD().map(new Function<String, LabeledPoint>() { @Override public LabeledPoint call(String row) throws Exception { String line = row.replaceAll("\?", "999999.0"); String[] tokens = line.split(","); Integer last = Integer.parseInt(tokens[13]); double[] features = new double[13]; for (int i = 0; i < 13; i++) { features[i] = Double.parseDouble(tokens[i]); } Vector v = new DenseVector(features); Double value = 0.0; if (last.intValue() > 0) value = 1.0; LabeledPoint lp = new LabeledPoint(value, v); return lp; } }); Using the replaceAll() method we have handled the invalid values like missing values that are specified in the original file using ? character. To get rid of the missing or invalid value we have replaced them with a very large value that has no side-effect to the original classification or predictive results. The reason behind this is that missing or sparse data can lead you to highly misleading results. Splitting the RDD of label point into training and test set Well, in the previous step, we have created the RDD label point data that can be used for the regression or classification task. Now we need to split the data as training and test set. That goes as follows: double[] weights = {0.7, 0.3}; long split_seed = 12345L; JavaRDD<LabeledPoint>[] split = data.randomSplit(weights, split_seed); JavaRDD<LabeledPoint> training = split[0]; JavaRDD<LabeledPoint> test = split[1]; If you see the preceding code segments, you will find that we have split the RDD label point as 70% as the training and 30% goes to the test set. The randomSplit() method does this split. Note that, set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. The split seed value is a long integer that signifies that split would be random but the result would not be a change in each run or iteration during the model building or training. Training the model and predict the heart diseases possibility At the first place, we will train the linear regression model which is the simplest regression classifier. final double stepSize = 0.0000000009; final int numberOfIterations = 40; LinearRegressionModel model = LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numberOfIterations, stepSize); As you can see the preceding code trains a linear regression model with no regularization using Stochastic Gradient Descent. This solves the least squares regression formulation f (weights) = 1/n ||A weights-y||^2^; which is the mean squared error. Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with its corresponding right-hand side label y. Also to train the model it takes the training set, number of iteration and the step size. We provide here some random value of the last two parameters. Model saving for future use Now let's save the model that we just created above for future use. It's pretty simple just use the following code by specifying the storage location as follows: String model_storage_loc = "models/heartModel"; model.save(spark.sparkContext(), model_storage_loc); Once the model is saved in your desired location, you will see the following output in your Eclipse console: Figure 2: The log after model saved to the storage Predictive analysis using the test set Now let's calculate the prediction score on the test dataset: JavaPairRDD<Double, Double> predictionAndLabel = test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<>(model.predict(p.features()), p.label()); } }); Predict the accuracy of the prediction: double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double> pl) { return pl._1().equals(pl._2()); } }).count() / (double) test.count(); System.out.println("Accuracy of the classification: "+accuracy); The output goes as follows: Accuracy of the classification: 0.0 Performance comparison among different classifier Unfortunately, there is no prediction accuracy at all, right? There might be several reasons for that, including: The dataset characteristic Model selection Parameters selection, that is, also called hyperparameter tuning Well, for the simplicity, we assume the dataset is okay; since, as already said that it is a widely used dataset used for machine learning research used by many researchers around the globe. Now, what next? Let's consider another classifier algorithm for example Random forest or decision tree classifier. What about the Random forest? Lets' go for the random forest classifier at second place. Just use below code to train the model using the training set. Integer numClasses = 26; //Number of classes //HashMap is used to restrict the delicacy in the tree construction HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>(); Integer numTrees = 5; // Use more in practice. String featureSubsetStrategy = "auto"; // Let the algorithm choose the best String impurity = "gini"; // also information gain & variance reduction available Integer maxDepth = 20; // set the value of maximum depth accordingly Integer maxBins = 40; // set the value of bin accordingly Integer seed = 12345; //Setting a long seed value is recommended final RandomForestModel model = RandomForest.trainClassifier(training, numClasses,categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed); We believe the parameters user by the trainClassifier() method is self-explanatory and we leave it to the readers to get know the significance of each parameter. Fantastic! We have trained the model using the Random forest classifier and cloud manage to save the model too for future use. Now if you reuse the same code that we described in the Predictive analysis using the test set step, you should have the output as follows: Accuracy of the classification: 0.7843137254901961 Much better, right? If you are still not satisfied, you can try with another classifier model like Naïve Bayes classifier. Predictive analytics using the new dataset As we already mentioned that we have saved the model for future use, now we should take the opportunity to use the same model for new datasets. The reason is if you recall the steps, we have trained the model using the training set and evaluate using the test set. Now if you have more data or new data available to be used? Will you go for re-training the model? Of course not since you will have to iterate several steps and you will have to sacrifice valuable time and cost too. Therefore, it would be wise to use the already trained model and predict the performance on a new dataset. Well, now let's reuse the stored model then. Note that you will have to reuse the same model that is to be trained the same model. For example, if you have done the model training using the Random forest classifier and saved the model while reusing you will have to use the same classifier model to load the saved model. Therefore, we will use the Random forest to load the model while using the new dataset. Use just the following code for doing that. Now create RDD label point from the new dataset (that is, Hungarian database with same 14 attributes): String new_data = "heart_diseases/processed_hungarian.data"; RDD<String> linesRDD = spark.sparkContext().textFile(new_data, 2); JavaRDD<LabeledPoint> data = linesRDD.toJavaRDD().map(new Function<String, LabeledPoint>() { @Override public LabeledPoint call(String row) throws Exception { String line = row.replaceAll("\?", "999999.0"); String[] tokens = line.split(","); Integer last = Integer.parseInt(tokens[13]); double[] features = new double[13]; for (int i = 0; i < 13; i++) { features[i] = Double.parseDouble(tokens[i]); } Vector v = new DenseVector(features); Double value = 0.0; if (last.intValue() > 0) value = 1.0; LabeledPoint p = new LabeledPoint(value, v); return p; } }); Now let's load the saved model using the Random forest model algorithm as follows: RandomForestModel model2 = RandomForestModel.load(spark.sparkContext(), model_storage_loc); Now let's calculate the prediction on test set: JavaPairRDD<Double, Double> predictionAndLabel = data.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { @Override public Tuple2<Double, Double> call(LabeledPoint p) { return new Tuple2<>(model2.predict(p.features()), p.label()); } }); Now calculate the accuracy of the prediction as follows: double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { @Override public Boolean call(Tuple2<Double, Double> pl) { return pl._1().equals(pl._2()); } }).count() / (double) data.count(); System.out.println("Accuracy of the classification: "+accuracy); We got the following output: Accuracy of the classification: 0.7380952380952381 To get more interesting and fantastic machine learning application like spam filtering, topic modelling for real-time streaming data, handling graph data for machine learning, market basket analysis, neighborhood clustering analysis, Air flight delay analysis, making the ML application adaptable, Model saving and reusing, hyperparameter tuning and model selection, breast cancer diagnosis and prognosis, heart diseases prediction, optical character recognition, hypothesis testing, dimensionality reduction for high dimensional data, large-scale text manipulation and many more visits inside. Moreover, the book also contains how to scaling up the ML model to handle massive big dataset on cloud computing infrastructure. Furthermore, some best practice in the machine learning techniques has also been discussed. In a nutshell many useful and exciting application have been developed using the following machine learning algorithms: Linear Support Vector Machine (SVM) Linear Regression Logistic Regression Decision Tree Classifier Random Forest Classifier K-means Clustering LDA topic modelling from static and real-time streaming data Naïve Bayes classifier Multilayer Perceptron classifier for deep classification Singular Value Decomposition (SVD) for dimensionality reduction Principal Component Analysis (PCA) for dimensionality reduction Generalized Linear Regression Chi Square Test (for goodness of fit test, independence test, and feature test) KolmogorovSmirnovTest for hypothesis test Spark Core for Market Basket Analysis Multi-label classification One Vs Rest classifier Gradient Boosting classifier ALS algorithm for movie recommendation Cross-validation for model selection Train Split for model selection RegexTokenizer, StringIndexer, StopWordsRemover, HashingTF and TF-IDF for text manipulation Summary In this article we came to know that how beneficial large scale machine learning with Spark is with respect to any field. Resources for Article: Further resources on this subject: Spark for Beginners [article] Setting up Spark [article] Holistic View on Spark [article]
Read more
  • 0
  • 0
  • 4852

article-image-diving-data-search-and-report
Packt
17 Oct 2016
11 min read
Save for later

Diving into Data – Search and Report

Packt
17 Oct 2016
11 min read
In this article by Josh Diakun, Paul R Johnson, and Derek Mock authors of the books Splunk Operational Intelligence Cookbook - Second Edition, we will cover the basic ways to search the data in Splunk. We will cover how to make raw event data readable (For more resources related to this topic, see here.) The ability to search machine data is one of Splunk's core functions, and it should come as no surprise that many other features and functions of Splunk are heavily driven-off searches. Everything from basic reports and dashboards to data models and fully featured Splunk applications are powered by Splunk searches behind the scenes. Splunk has its own search language known as the Search Processing Language (SPL). This SPL contains hundreds of search commands, most of which also have several functions, arguments, and clauses. While a basic understanding of SPL is required in order to effectively search your data in Splunk, you are not expected to know all the commands! Even the most seasoned ninjas do not know all the commands and regularly refer to the Splunk manuals, website, or Splunk Answers (http://answers.splunk.com). To get you on your way with SPL, be sure to check out the search command cheat sheet and download the handy quick reference guide available at http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/SplunkEnterpriseQuickReferenceGuide. Searching Searches in Splunk usually start with a base search, followed by a number of commands that are delimited by one or more pipe (|) characters. The result of a command or search to the left of the pipe is used as the input for the next command to the right of the pipe. Multiple pipes are often found in a Splunk search to continually refine data results as needed. As we go through this article, this concept will become very familiar to you. Splunk allows you to search for anything that might be found in your log data. For example, the most basic search in Splunk might be a search for a keyword such as error or an IP address such as 10.10.12.150. However, searching for a single word or IP over the terabytes of data that might potentially be in Splunk is not very efficient. Therefore, we can use the SPL and a number of Splunk commands to really refine our searches. The more refined and granular the search, the faster the time to run and the quicker you get to the data you are looking for! When searching in Splunk, try to filter as much as possible before the first pipe (|) character, as this will save CPU and disk I/O. Also, pick your time range wisely. Often, it helps to run the search over a small time range when testing it and then extend the range once the search provides what you need. Boolean operators There are three different types of Boolean operators available in Splunk. These are AND, OR, and NOT. Case sensitivity is important here, and these operators must be in uppercase to be recognized by Splunk. The AND operator is implied by default and is not needed, but does no harm if used. For example, searching for the term error or success would return all the events that contain either the word error or the word success. Searching for error success would return all the events that contain the words error and success. Another way to write this can be error AND success. Searching web access logs for error OR success NOT mozilla would return all the events that contain either the word error or success, but not those events that also contain the word mozilla. Common commands There are many commands in Splunk that you will likely use on a daily basis when searching data within Splunk. These common commands are outlined in the following table: Command Description chart/timechart This command outputs results in a tabular and/or time-based output for use by Splunk charts. dedup This command de-duplicates results based upon specified fields, keeping the most recent match. eval This command evaluates new or existing fields and values. There are many different functions available for eval. fields This command specifies the fields to keep or remove in search results. head This command keeps the first X (as specified) rows of results. lookup This command looks up fields against an external source or list, to return additional field values. rare This command identifies the least common values of a field. rename This command renames the fields. replace This command replaces the values of fields with another value. search This command permits subsequent searching and filtering of results. sort This command sorts results in either ascending or descending order. stats This command performs statistical operations on the results. There are many different functions available for stats. table This command formats the results into a tabular output. tail This command keeps only the last X (as specified) rows of results. top This command identifies the most common values of a field. transaction This command merges events into a single event based upon a common transaction identifier. Time modifiers The drop-down time range picker in the Graphical User Interface (GUI) to the right of the Splunk search bar allows users to select from a number of different preset and custom time ranges. However, in addition to using the GUI, you can also specify time ranges directly in your search string using the earliest and latest time modifiers. When a time modifier is used in this way, it automatically overrides any time range that might be set in the GUI time range picker. The earliest and latest time modifiers can accept a number of different time units: seconds (s), minutes (m), hours (h), days (d), weeks (w), months (mon), quarters (q), and years (y). Time modifiers can also make use of the @ symbol to round down and snap to a specified time. For example, searching for sourcetype=access_combined earliest=-1d@d latest=-1h will search all the access_combined events from midnight, a day ago until an hour ago from now. Note that the snap (@) will round down such that if it were 12 p.m. now, we would be searching from midnight a day and a half ago until 11 a.m. today. Working with fields Fields in Splunk can be thought of as keywords that have one or more values. These fields are fully searchable by Splunk. At a minimum, every data source that comes into Splunk will have the source, host, index, and sourcetype fields, but some source might have hundreds of additional fields. If the raw log data contains key-value pairs or is in a structured format such as JSON or XML, then Splunk will automatically extract the fields and make them searchable. Splunk can also be told how to extract fields from the raw log data in the backend props.conf and transforms.conf configuration files. Searching for specific field values is simple. For example, sourcetype=access_combined status!=200 will search for events with a sourcetype field value of access_combined that has a status field with a value other than 200. Splunk has a number of built-in pre-trained sourcetypes that ship with Splunk Enterprise that might work with out-of-the-box, common data sources. These are available at http://docs.splunk.com/Documentation/Splunk/latest/Data/Listofpretrainedsourcetypes. In addition, Technical Add-Ons (TAs), which contain event types and field extractions for many other common data sources such as Windows events, are available from the Splunk app store at https://splunkbase.splunk.com. Saving searches Once you have written a nice search in Splunk, you may wish to save the search so that you can use it again at a later date or use it for a dashboard. Saved searches in Splunk are known as Reports. To save a search in Splunk, you simply click on the Save As button on the top right-hand side of the main search bar and select Report. Making raw event data readable When a basic search is executed in Splunk from the search bar, the search results are displayed in a raw event format by default. To many users, this raw event information is not particularly readable, and valuable information is often clouded by other less valuable data within the event. Additionally, if the events span several lines, only a few events can be seen on the screen at any one time. In this recipe, we will write a Splunk search to demonstrate how we can leverage Splunk commands to make raw event data readable, tabulating events and displaying only the fields we are interested in. Getting ready You should be familiar with the Splunk search bar and search results area. How to do it… Follow the given steps to search and tabulate the selected event data: Log in to your Splunk server. Select the Search & Reporting application from the drop-down menu located in the top left-hand side of the screen. Set the time range picker to Last 24 hours and type the following search into the Splunk search bar: index=main sourcetype=access_combined Then, click on Search or hit Enter. Splunk will return the results of the search and display the raw search events under the search bar. Let's rerun the search, but this time we will add the table command as follows: index=main sourcetype=access_combined | table _time, referer_domain, method, uri_path, status, JSESSIONID, useragent Splunk will now return the same number of events, but instead of presenting the raw events to you, the data will be in a nicely formatted table, displaying only the fields we specified. This is much easier to read! Save this search by clicking on Save As and then on Report. Give the report the name cp02_tabulated_webaccess_logs and click on Save. On the next screen, click on Continue Editing to return to the search. How it works… Let's break down the search piece by piece: Search fragment Description index=main All the data in Splunk is held in one or more indexes. While not strictly necessary, it is a good practice to specify the index (es) to search, as this will ensure a more precise search. sourcetype=access_combined This tells Splunk to search only the data associated with the access_combined sourcetype, which, in our case, is the web access logs. | table _time, referer_domain, method, uri_path, action, JSESSIONID, useragent Using the table command, we take the result of our search to the left of the pipe and tell Splunk to return the data in a tabular format. Splunk will only display the fields specified after the table command in the table of results.  In this recipe, you used the table command. The table command can have a noticeable performance impact on large searches. It should be used towards the end of a search, once all the other processing on the data by the other Splunk commands has been performed. The stats command is more efficient than the table command and should be used in place of table where possible. However, be aware that stats and table are two very different commands. There's more… The table command is very useful in situations where we wish to present data in a readable format. Additionally, tabulated data in Splunk can be downloaded as a CSV file, which many users find useful for offline processing in spreadsheet software or for sending to others. There are some other ways we can leverage the table command to make our raw event data readable. Tabulating every field Often, there are situations where we want to present every event within the data in a tabular format, without having to specify each field one by one. To do this, we simply use a wildcard (*) character as follows: index=main sourcetype=access_combined | table * Removing fields, then tabulating everything else While tabulating every field using the wildcard (*) character is useful, you will notice that there are a number of Splunk internal fields, such as _raw, that appear in the table. We can use the fields command before the table command to remove the fields as follows: index=main sourcetype=access_combined | fields - sourcetype, index, _raw, source date* linecount punct host time* eventtype | table * If we do not include the minus (-) character after the fields command, Splunk will keep the specified fields and remove all the other fields. Summary In this article we covered along with the introduction to Splunk, how to make raw event data readable Resources for Article: Further resources on this subject: Splunk's Input Methods and Data Feeds [Article] The Splunk Interface [Article] The Splunk Web Framework [Article]
Read more
  • 0
  • 0
  • 1170

article-image-iot-and-decision-science
Packt
13 Oct 2016
10 min read
Save for later

IoT and Decision Science

Packt
13 Oct 2016
10 min read
In this article by Jojo Moolayil, author of the book Smarter Decisions - The Intersection of Internet of Things and Decision Science, you will learn that the Internet of Things (IoT) and Decision Science have been among the hottest topics in the industry for a while now. You would have heard about IoT and wanted to learn more about it, but unfortunately you would have come across multiple names and definitions over the Internet with hazy differences between them. Also, Decision Science has grown from a nascent domain to become one of the fastest and most widespread horizontal in the industry in the recent years. With the ever-increasing volume, variety, and veracity of data, decision science has become more and more valuable for the industry. Using data to uncover latent patterns and insights to solve business problems has made it easier for businesses to take actions with better impact and accuracy. (For more resources related to this topic, see here.) Data is the new oil for the industry, and with the boom of IoT, we are in a world where more and more devices are getting connected to the Internet with sensors capturing more and more vital granular dimensions details that had never been touched earlier. The IoT is a game changer with a plethora of devices connected to each other; the industry is eagerly attempting to untap the huge potential that it can deliver. The true value and impact of IoT is delivered with the help of Decision Science. IoT has inherently generated an ocean of data where you can swim to gather insights and take smarter decisions with the intersection of Decision Science and IoT. In this book, you will learn about IoT and Decision Science in detail by solving real-life IoT business problems using a structured approach. In this article, we will begin by understanding the fundamental basics of IoT and Decision Science problem solving. You will learn the following concepts: Understanding IoT and demystifying Machine to Machine (M2M), IoT, Internet of Everything (IoE), and Industrial IoT (IIoT) Digging deeper into the logical stack of IoT Studying the problem life cycle Exploring the problem landscape The art of problem solving The problem solving framework It is highly recommended that you explore this article in depth. It focuses on the basics and concepts required to build problems and use cases. Understanding the IoT To get started with the IoT, lets first try to understand it using the easiest constructs. Internet and Things; we have two simple words here that help us understand the entire concept. So what is the Internet? It is basically a network of computing devices. Similarly, what is a Thing? It could be any real-life entity featuring Internet connectivity. So now, what do we decipher from IoT? It is a network of connected Things that can transmit and receive data from other things once connected to the network. This is how we describe the Internet of Things in a nutshell. Now, let's take a glance at the definition. IoT can be defined as the ever-growing network of Things (entities) that feature Internet connectivity and the communication that occurs between them and other Internet-enabled devices and systems. The Things in IoT are enabled with sensors that capture vital information from the device during its operations, and the device features Internet connectivity that helps it transfer and communicate to other devices and the network. Today, when we discuss about IoT, there are so many other similar terms that come into the picture, such as Industrial Internet, M2M, IoE, and a few more, and we find it difficult to understand the differences between them. Before we begin delineating the differences between these hazy terms and understand how IoT evolved in the industry, lets first take a simple real-life scenario to understand how exactly IoT looks like. IoT in a real-life scenario Let's take a simple example to understand how IoT works. Consider a scenario where you are a father in a family with a working mother and 10-year old son studying in school. You and your wife work in different offices. Your house is equipped with quite a few smart devices, say, a smart microwave, smart refrigerator, and smart TV. You are currently in office and you get notified on your smartphone that your son, Josh, has reached home from school. (He used his personal smart key to open the door.) You then use your smartphone to turn on the microwave at home to heat the sandwiches kept in it. Your son gets notified on the smart home controller that you have hot sandwiches ready for him. He quickly finishes them and starts preparing for a math test at school and you resume your work. After a while, you get notified again that your wife has also reached home (She also uses a similar smart key.) and you suddenly realize that you need to reach home to help your son with his math test. You again use your smartphone and change the air conditioner settings for three people and set the refrigerator to defrost using the app. In another 15 minutes, you are home and the air conditioning temperature is well set for three people. You then grab a can of juice from the refrigerator and discuss some math problems with your son on the couch. Intuitive, isnt it? How did it his happen and how did you access and control everything right from your phone? Well, this is how IoT works! Devices can talk to each other and also take actions based on the signals received: The IoT scenario Lets take a closer look at the same scenario. You are sitting in office and you could access the air conditioner, microwave, refrigerator, and home controller through your smartphone. Yes, the devices feature Internet connectivity and once connected to the network, they can send and receive data from other devices and take actions based on signals. A simple protocol helps these devices understand and send data and signals to a plethora of heterogeneous devices connected to the network. We will get into the details of the protocol and how these devices talk to each other soon. However, before that, we will get into some details of how this technology started and why we have so many different names today for IoT. Demystifying M2M, IoT, IIoT, and IoE So now that we have a general understanding about what is IoT, lets try to understand how it all started. A few questions that we will try to understand are: Is IoT very new in the market?, When did this start?, How did this start?, Whats the difference between M2M, IoT, IoE, and all those different names?, and so on. If we try to understand the fundamentals of IoT, that is, machines or devices connected to each other in a network, which isn't something really new and radically challenging, then what is this buzz all about? The buzz about machines talking to each other started long before most of us thought of it, and back then it was called Machine to Machine Data. In early 1950, a lot of machinery deployed for aerospace and military operations required automated communication and remote access for service and maintenance. Telemetry was where it all started. It is a process in which a highly automated communication was established from which data is collected by making measurements at remote or inaccessible geographical areas and then sent to a receiver through a cellular or wired network where it was monitored for further actions. To understand this better, lets take an example of a manned space shuttle sent for space exploration. A huge number of sensors are installed in such a space shuttle to monitor the physical condition of astronauts, the environment, and also the condition of the space shuttle. The data collected through these sensors is then sent back to the substation located on Earth, where a team would use this data to analyze and take further actions. During the same time, industrial revolution peaked and a huge number of machines were deployed in various industries. Some of these industries where failures could be catastrophic also saw the rise in machine-to-machine communication and remote monitoring: Telemetry Thus, machine-to-machine data a.k.a. M2M was born and mainly through telemetry. Unfortunately, it didnt scale to the extent that it was supposed to and this was largely because of the time it was developed in. Back then, cellular connectivity was not widespread and affordable, and installing sensors and developing the infrastructure to gather data from them was a very expensive deal. Therefore, only a small chunk of business and military use cases leveraged this. As time passed, a lot of changes happened. The Internet was born and flourished exponentially. The number of devices that got connected to the Internet was colossal. Computing power, storage capacities, and communication and technology infrastructure scaled massively. Additionally, the need to connect devices to other devices evolved, and the cost of setting up infrastructure for this became very affordable and agile. Thus came the IoT. The major difference between M2M and IoT initially was that the latter used the Internet (IPV4/6) as the medium whereas the former used cellular or wired connection for communication. However, this was mainly because of the time they evolved in. Today, heavy engineering industries have machinery deployed that communicate over the IPV4/6 network and is called Industrial IoT or sometimes M2M. The difference between the two is bare minimum and there are enough cases where both are used interchangeably. Therefore, even though M2M was actually the ancestor of IoT, today both are pretty much the same. M2M or IIoT are nowadays aggressively used to market IoT disruptions in the industrial sector. IoE or Internet of Everything was a term that surfaced on the media and Internet very recently. The term was coined by Cisco with a very intuitive definition. It emphasizes Humans as one dimension in the ecosystem. It is a more organized way of defining IoT. The IoE has logically broken down the IoT ecosystem into smaller components and simplified the ecosystem in an innovative way that was very much essential. IoE divides its ecosystem into four logical units as follows: People Processes Data Devices Built on the foundation of IoT, IoE is defined as The networked connection of People, Data, Processes, and Things. Overall, all these different terms in the IoT fraternity have more similarities than differences and, at the core, they are the same, that is, devices connecting to each other over a network. The names are then stylized to give a more intrinsic connotation of the business they refer to, such as Industrial IoT and Machine to Machine for (B2B) heavy engineering, manufacturing and energy verticals, Consumer IoT for the B2C industries, and so on. Summary In this article we learnt how to start with the IoT. It is basically a network of computing devices. Similarly, what is a Thing? It could be any real-life entity featuring Internet connectivity. So now, what do we decipher from IoT? It is a network of connected Things that can transmit and receive data from other things once connected to the network. This is how we describe the Internet of Things in a nutshell. Resources for Article: Further resources on this subject: Machine Learning Tasks [article] Welcome to Machine Learning Using the .NET Framework [article] Why Big Data in the Financial Sector? [article]
Read more
  • 0
  • 0
  • 1789
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-reconstructing-3d-scenes
Packt
13 Oct 2016
25 min read
Save for later

Reconstructing 3D Scenes

Packt
13 Oct 2016
25 min read
In this article by Robert Laganiere, the author of the book OpenCV 3 Computer Vision Application Programming Cookbook Third Edition, has covered the following recipes: Calibrating a camera Recovering camera pose (For more resources related to this topic, see here.) Digital image formation Let us now redraw a new version of the figure describing the pin-hole camera model. More specifically, we want to demonstrate the relation between a point in 3D at position (X,Y,Z) and its image (x,y), on a camera specified in pixel coordinates. Note the changes that have been made to the original figure. First, a reference frame was positioned at the center of the projection; then, the Y-axis was alligned to point downwards to get a coordinate system that is compatible with the usual convention, which places the image origin at the upper-left corner of the image. Finally, we have identified a special point on the image plane, by considering the line coming from the focal point that is orthogonal to the image plane. The point (u0,v0) is the pixel position at which this line pierces the image plane and is called the principal point. It would be logical to assume that this principal point is at the center of the image plane, but in practice, this one might be off by few pixels depending on the precision of the camera. Since we are dealing with digital images, the number of pixels on the image plane (its resolution) is another important characteristic of a camera. We learned previously that a 3D point (X,Y,Z) will be projected onto the image plane at (fX/Z,fY/Z). Now, if we want to translate this coordinate into pixels, we need to divide the 2D image position by the pixel width (px) and then the height (py). Note that by dividing the focal length given in world units (generally given in millimeters) by px, we obtain the focal length expressed in (horizontal) pixels. We will then define this term as fx. Similarly, fy =f/py is defined as the focal length expressed in vertical pixel unit. The complete projective equation is therefore as shown: We know that (u0,v0) is the principal point that is added to the result in order to move the origin to the upper-left corner of the image. Also, the physical size of a pixel can be obtained by dividing the size of the image sensor (generally in millimeters) by the number of pixels (horizontally or vertically). In modern sensor, pixels are generally of square shape, that is, they have the same horizontal and vertical size. The preceding equations can be rewritten in matrix form. Here is the complete projective equation in its most general form: Calibrating a camera Camera calibration is the process by which the different camera parameters (that is, the ones appearing in the projective equation) are obtained. One can obviously use the specifications provided by the camera manufacturer, but for some tasks, such as 3D reconstruction, these specifications are not accurate enough. However, accurate calibration information can be obtained by undertaking an appropriate camera calibration step. An active camera calibration procedure will proceed by showing known patterns to the camera and analyzing the obtained images. An optimization process will then determine the optimal parameter values that explain the observations. This is a complex process that has been made easy by the availability of OpenCV calibration functions. How to do it... To calibrate a camera, the idea is to show it a set of scene points for which the 3D positions are known. Then, you need to observe where these points project on the image. With the knowledge of a sufficient number of 3D points and associated 2D image points, the exact camera parameters can be inferred from the projective equation. Obviously, for accurate results, we need to observe as many points as possible. One way to achieve this would be to take a picture of a scene with known 3D points, but in practice, this is rarely feasible. A more convenient way is to take several images of a set of 3D points from different viewpoints. This approach is simpler, but it requires you to compute the position of each camera view in addition to the computation of the internal camera parameters, which is fortunately feasible. OpenCV proposes that you use a chessboard pattern to generate the set of 3D scene points required for calibration. This pattern creates points at the corners of each square, and since this pattern is flat, we can freely assume that the board is located at Z=0, with the X and Y axes well-aligned with the grid. In this case, the calibration process simply consists of showing the chessboard pattern to the camera from different viewpoints. The following is an example of a calibration pattern image made of 7x5 inner corners as captured during the calibration step: The good thing is that OpenCV has a function that automatically detects the corners of this chessboard pattern. You simply provide an image and the size of the chessboard used (the number of horizontal and vertical inner corner points). The function will return the position of these chessboard corners on the image. If the function fails to find the pattern, then it simply returns false, as shown: //output vectors of image points std::vector<cv::Point2f> imageCorners; //number of inner corners on the chessboard cv::Size boardSize(7,5); //Get the chessboard corners bool found = cv::findChessboardCorners(image, // image of chessboard pattern boardSize, // size of pattern imageCorners); // list of detected corners The output parameter, imageCorners, will simply contain the pixel coordinates of the detected inner corners of the shown pattern. Note that this function accepts additional parameters if you need to tune the algorithm, which are not discussed here. There is also a special function that draws the detected corners on the chessboard image, with lines connecting them in a sequence: //Draw the corners cv::drawChessboardCorners(image, boardSize, imageCorners, found); // corners have been found The following image is obtained: The lines that connect the points show the order in which the points are listed in the vector of detected image points. To perform a calibration, we now need to specify the corresponding 3D points. You can specify these points in the units of your choice (for example, in centimeters or in inches); however, the simplest is to assume that each square represents one unit. In that case, the coordinates of the first point would be (0,0,0) (assuming that the board is located at a depth of Z=0), the coordinates of the second point would be (1,0,0), and so on, the last point being located at (6,4,0). There are a total of 35 points in this pattern, which is too less to obtain an accurate calibration. To get more points, you need to show more images of the same calibration pattern from various points of view. To do so, you can either move the pattern in front of the camera or move the camera around the board; from a mathematical point of view, this is completely equivalent. The OpenCV calibration function assumes that the reference frame is fixed on the calibration pattern and will calculate the rotation and translation of the camera with respect to the reference frame. Let‘s now encapsulate the calibration process in a CameraCalibrator class. The attributes of this class are as follows: // input points: // the points in world coordinates // (each square is one unit) std::vector<std::vector<cv::Point3f>> objectPoints; // the image point positions in pixels std::vector<std::vector<cv::Point2f>> imagePoints; // output Matrices cv::Mat cameraMatrix; cv::Mat distCoeffs; // flag to specify how calibration is done int flag; Note that the input vectors of the scene and image points are in fact made of std::vector of point instances; each vector element is a vector of the points from one view. Here, we decided to add the calibration points by specifying a vector of the chessboard image filename as input, the method will take care of extracting the point coordinates from the images: // Open chessboard images and extract corner points int CameraCalibrator::addChessboardPoints(const std::vector<std::string>& filelist, // list of filenames cv::Size & boardSize) { // calibration noard size // the points on the chessboard std::vector<cv::Point2f> imageCorners; std::vector<cv::Point3f> objectCorners; // 3D Scene Points: // Initialize the chessboard corners // in the chessboard reference frame // The corners are at 3D location (X,Y,Z)= (i,j,0) for (int i=0; i<boardSize.height; i++) { for (int j=0; j<boardSize.width; j++) { objectCorners.push_back(cv::Point3f(i, j, 0.0f)); } } // 2D Image points: cv::Mat image; // to contain chessboard image int successes = 0; // for all viewpoints for (int i=0; i<filelist.size(); i++) { // Open the image image = cv::imread(filelist[i],0); // Get the chessboard corners bool found = cv::findChessboardCorners(image, //image of chessboard pattern boardSize, // size of pattern imageCorners); // list of detected corners // Get subpixel accuracy on the corners if (found) { cv::cornerSubPix(image, imageCorners, cv::Size(5, 5), // half size of serach window cv::Size(-1, -1), cv::TermCriteria(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS,30,// max number of iterations 0.1)); // min accuracy // If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } //If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } return successes; } The first loop inputs the 3D coordinates of the chessboard and the corresponding image points are the ones provided by the cv::findChessboardCorners function; this is done for all the available viewpoints. Moreover, in order to obtain a more accurate image point location, the cv::cornerSubPix function can be used, and as the name suggests, the image points will then be localized at subpixel accuracy. The termination criterion that is specified by the cv::TermCriteria object defines the maximum number of iterations and the minimum accuracy in subpixel coordinates. The first of these two conditions that is reached will stop the corner refinement process. When a set of chessboard corners have been successfully detected, these points are added to the vectors of the image and scene points using our addPoints method. Once a sufficient number of chessboard images have been processed (and consequently, a large number of 3D scene point / 2D image point correspondences are available), we can initiate the computation of the calibration parameters as shown: // Calibrate the camera // returns the re-projection error double CameraCalibrator::calibrate(cv::Size &imageSize){ //Output rotations and translations std::vector<cv::Mat> rvecs, tvecs; // start calibration return calibrateCamera(objectPoints, // the 3D points imagePoints, // the image points imageSize, // image size cameraMatrix, // output camera matrix distCoeffs, // output distortion matrix rvecs, tvecs, // Rs, Ts flag); // set options } In practice, 10 to 20 chessboard images are sufficient, but these must be taken from different viewpoints at different depths. The two important outputs of this function are the camera matrix and the distortion parameters. These will be described in the next section. How it works... In order to explain the result of the calibration, we need to go back to the projective equation presented in the introduction of this article. This equation describes the transformation of a 3D point into a 2D point through the successive application of two matrices. The first matrix includes all of the camera parameters, which are called the intrinsic parameters of the camera. This 3x3 matrix is one of the output matrices returned by the cv::calibrateCamera function. There is also a function called cv::calibrationMatrixValues that explicitly returns the value of the intrinsic parameters given by a calibration matrix. The second matrix is there to have the input points expressed into camera-centric coordinates. It is composed of a rotation vector (a 3x3 matrix) and a translation vector (a 3x1 matrix). Remember that in our calibration example, the reference frame was placed on the chessboard. Therefore, there is a rigid transformation (made of a rotation component represented by the matrix entries r1 to r9 and a translation represented by t1, t2, and t3) that must be computed for each view. These are in the output parameter list of the cv::calibrateCamera function. The rotation and translation components are often called the extrinsic parameters of the calibration and they are different for each view. The intrinsic parameters remain constant for a given camera/lens system. The calibration results provided by the cv::calibrateCamera are obtained through an optimization process. This process aims to find the intrinsic and extrinsic parameters that minimizes the difference between the predicted image point position, as computed from the projection of the 3D scene points, and the actual image point position, as observed on the image. The sum of this difference for all the points specified during the calibration is called the re-projection error. The intrinsic parameters of our test camera obtained from a calibration based on the 27 chessboard images are fx=409 pixels; fy=408 pixels; u0=237; and v0=171. Our calibration images have a size of 536x356 pixels. From the calibration results, you can see that, as expected, the principal point is close to the center of the image, but yet off by few pixels. The calibration images were taken using a Nikon D500 camera with an 18mm lens. Looking at the manufacturer specifitions, we find that the sensor size of this camera is 23.5mm x 15.7mm which gives us a pixel size of 0.0438mm. The estimated focal length is expressed in pixels, so multiplying the result by the pixel size gives us an estimated focal length of 17.8mm, which is consistent with the actual lens we used. Let us now turn our attention to the distortion parameters. So far, we have mentioned that under the pin-hole camera model, we can neglect the effect of the lens. However, this is only possible if the lens that is used to capture an image does not introduce important optical distortions. Unfortunately, this is not the case with lower quality lenses or with lenses that have a very short focal length. Even the lens we used in this experiment introduced some distortion, that is, the edges of the rectangular board are curved in the image. Note that this distortion becomes more important as we move away from the center of the image. This is a typical distortion observed with a fish-eye lens and is called radial distortion. It is possible to compensate for these deformations by introducing an appropriate distortion model. The idea is to represent the distortions induced by a lens by a set of mathematical equations. Once established, these equations can then be reverted in order to undo the distortions visible on the image. Fortunately, the exact parameters of the transformation, which will correct the distortions, can be obtained together with the other camera parameters during the calibration phase. Once this is done, any image from the newly calibrated camera will be undistorted. Therefore, we have added an additional method to our calibration class. //remove distortion in an image (after calibration) cv::Mat CameraCalibrator::remap(const cv::Mat &image) { cv::Mat undistorted; if (mustInitUndistort) { //called once per calibration cv::initUndistortRectifyMap(cameraMatrix, // computed camera matrix distCoeffs, // computed distortion matrix cv::Mat(), // optional rectification (none) cv::Mat(), // camera matrix to generate undistorted image.size(), // size of undistorted CV_32FC1, // type of output map map1, map2); // the x and y mapping functions mustInitUndistort= false; } // Apply mapping functions cv::remap(image, undistorted, map1, map2, cv::INTER_LINEAR); // interpolation type return undistorted; } Running this code on one of our calibration image results in the following undistorted image: To correct the distortion, OpenCV uses a polynomial function that is applied to the image points in order to move them at their undistorted position. By default, five coefficients are used; a model made of eight coefficients is also available. Once these coefficients are obtained, it is possible to compute two cv::Mat mapping functions (one for the x coordinate and one for the y coordinate) that will give the new undistorted position of an image point on a distorted image. This is computed by the cv::initUndistortRectifyMap function, and the cv::remap function remaps all the points of an input image to a new image. Note that because of the nonlinear transformation, some pixels of the input image now fall outside the boundary of the output image. You can expand the size of the output image to compensate for this loss of pixels, but you now obtain output pixels that have no values in the input image (they will then be displayed as black pixels). There‘s more... More options are available when it comes to camera calibration. Calibration with known intrinsic parameters When a good estimate of the camera’s intrinsic parameters is known, it could be advantageous to input them in the cv::calibrateCamera function. They will then be used as initial values in the optimization process. To do so, you just need to add the cv::CALIB_USE_INTRINSIC_GUESS flag and input these values in the calibration matrix parameter. It is also possible to impose a fixed value for the principal point (cv::CALIB_FIX_PRINCIPAL_POINT), which can often be assumed to be the central pixel. You can also impose a fixed ratio for the focal lengths fx and fy (cv::CALIB_FIX_RATIO); in which case, you assume that the pixels have a square shape. Using a grid of circles for calibration Instead of the usual chessboard pattern, OpenCV also offers the possibility to calibrate a camera by using a grid of circles. In this case, the centers of the circles are used as calibration points. The corresponding function is very similar to the function we used to locate the chessboard corners, for example: cv::Size boardSize(7,7); std::vector<cv::Point2f> centers; bool found = cv:: findCirclesGrid(image, boardSize, centers); See also The A flexible new technique for camera calibration article by Z. Zhang  in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no 11, 2000, is a classic paper on the problem of camera calibration Recovering camera pose When a camera is calibrated, it becomes possible to relate the captured with the outside world. If the 3D structure of an object is known, then one can predict how the object will be imaged on the sensor of the camera. The process of image formation is indeed completely described by the projective equation that was presented at the beginning of this article. When most of the terms of this equation are known, it becomes possible to infer the value of the other elements (2D or 3D) through the observation of some images. In this recipe, we will look at the camera pose recovery problem when a known 3D structure is observed. How to do it... Lets consider a simple object here, a bench in a park. We took an image of it using the camera/lens system calibrated in the previous recipe. We have manually identified 8 distinct image points on the bench that we will use for our camera pose estimation. Having access to this object makes it possible to make some physical measurements. The bench is composed of a seat of size 242.5cmx53.5cmx9cm and a back of size 242.5cmx24cmx9cm that is fixed 12cm over the seat. Using this information, we can then easily derive the 3D coordinates of the eight identified points in an object-centric reference frame (here we fixed the origin at the left extremity of the intersection between the two planes). We can then create a vector of cv::Point3f containing these coordinates. //Input object points std::vector<cv::Point3f> objectPoints; objectPoints.push_back(cv::Point3f(0, 45, 0)); objectPoints.push_back(cv::Point3f(242.5, 45, 0)); objectPoints.push_back(cv::Point3f(242.5, 21, 0)); objectPoints.push_back(cv::Point3f(0, 21, 0)); objectPoints.push_back(cv::Point3f(0, 9, -9)); objectPoints.push_back(cv::Point3f(242.5, 9, -9)); objectPoints.push_back(cv::Point3f(242.5, 9, 44.5)); objectPoints.push_back(cv::Point3f(0, 9, 44.5)); The question now is where the camera was with respect to these points when the shown picture was taken. Since the coordinates of the image of these known points on the 2D image plane are also known, it becomes easy to answer this question using the cv::solvePnP function. Here, the correspondence between the 3D and the 2D points has been established manually, but as a reader of this book, you should be able to come up with methods that would allow you to obtain this information automatically. //Input image points std::vector<cv::Point2f> imagePoints; imagePoints.push_back(cv::Point2f(136, 113)); imagePoints.push_back(cv::Point2f(379, 114)); imagePoints.push_back(cv::Point2f(379, 150)); imagePoints.push_back(cv::Point2f(138, 135)); imagePoints.push_back(cv::Point2f(143, 146)); imagePoints.push_back(cv::Point2f(381, 166)); imagePoints.push_back(cv::Point2f(345, 194)); imagePoints.push_back(cv::Point2f(103, 161)); // Get the camera pose from 3D/2D points cv::Mat rvec, tvec; cv::solvePnP(objectPoints, imagePoints, // corresponding 3D/2D pts cameraMatrix, cameraDistCoeffs, // calibration rvec, tvec); // output pose // Convert to 3D rotation matrix cv::Mat rotation; cv::Rodrigues(rvec, rotation); This function computes the rigid transformation (rotation and translation) that brings the object coordinates in the camera-centric reference frame (that is, the ones that has its origin at the focal point). It is also important to note that the rotation computed by this function is given in the form of a 3D vector. This is a compact representation in which the rotation to apply is described by a unit vector (an axis of rotation) around which the object is rotated by a certain angle. This axis-angle representation is also called the Rodrigues’ rotation formula. In OpenCV, the angle of rotation corresponds to the norm of the output rotation vector, which is later aligned with the axis of rotation. This is why the cv::Rodrigues function is used to obtain the 3D matrix of rotation that appears in our projective equation. The pose recovery procedure described here is simple, but how do we know we obtained the right camera/object pose information. We can visually assess the quality of the results using the cv::viz module that gives us the ability to visualize 3D information. The use of this module is explained in the last section of this recipe. For now, lets display a simple 3D representation of our object and the camera that captured it: It might be difficult to judge of the quality of the pose recovery just by looking at this image but if you test the example of this recipe on your computer, you will have the possibility to move this representation in 3D using your mouse which should give you a better sense of the solution obtained. How it works... In this recipe, we assumed that the 3D structure of the object was known as well as the correspondence between sets of object points and image points. The camera’s intrinsic parameters were also known through calibration. If you look at our projective equation presented at the end of the Digital image formation section of the introduction of this article, this means that we have points for which coordinates (X,Y,Z) and (x,y) are known. We also have the elements of first matrix known (the intrinsic parameters). Only the second matrix is unknown; this is the one that contains the extrinsic parameters of the camera that is the camera/object pose information. Our objective is to recover these unknown parameters from the observation of 3D scene points. This problem is known as the Perspective-n-Point problem or PnP problem. Rotation has three degrees of freedom (for example, angle of rotation around the three axes) and translation also has three degrees of freedom. We therefore have a total of 6 unknowns. For each object point/image point correspondence, the projective equation gives us three algebraic equations but since the projective equation is up to a scale factor, we only have 2 independent equations. A minimum of three points is therefore required to solve this system of equations. Obviously, more points provide a more reliable estimate. In practice, many different algorithms have been proposed to solve this problem and OpenCV proposes a number of different implementation in its cv::solvePnP function. The default method consists in optimizing what is called the reprojection error. Minimizing this type of error is considered to be the best strategy to get accurate 3D information from camera images. In our problem, it corresponds to finding the optimal camera position that minimizes the 2D distance between the projected 3D points (as obtained by applying the projective equation) and the observed image points given as input. Note that OpenCV also has a cv::solvePnPRansac function. As the name suggest, this function uses the RANSAC algorithm in order to solve the PnP problem. This means that some of the object points/image points correspondences may be wrong and the function will returns which ones have been identified as outliers. This is very useful when these correspondences have been obtained through an automatic process that can fail for some points. There‘s more... When working with 3D information, it is often difficult to validate the solutions obtained. To this end, OpenCV offers a simple yet powerful visualization module that facilitates the development and debugging of 3D vision algorithms. It allows inserting points, lines, cameras, and other objects in a virtual 3D environment that you can interactively visualize from various points of views. cv::Viz, a 3D Visualizer module cv::Viz is an extra module of the OpenCV library that is built on top of the VTK open source library. This Visualization Tooolkit (VTK) is a powerful framework used for 3D computer graphics. With cv::viz, you create a 3D virtual environment to which you can add a variety of objects. A visualization window is created that displays the environment from a given point of view. You saw in this recipe an example of what can be displayed in a cv::viz window. This window responds to mouse events that are used to navigate inside the environment (through rotations and translations). This section describes the basic use of the cv::viz module. The first thing to do is to create the visualization window. Here we use a white background: // Create a viz window cv::viz::Viz3d visualizer(“Viz window“); visualizer.setBackgroundColor(cv::viz::Color::white()); Next, you create your virtual objects and insert them into the scene. There is a variety of predefined objects. One of them is particularly useful for us; it is the one that creates a virtual pin-hole camera: // Create a virtual camera cv::viz::WCameraPosition cam(cMatrix, // matrix of intrinsics image, // image displayed on the plane 30.0, // scale factor cv::viz::Color::black()); // Add the virtual camera to the environment visualizer.showWidget(“Camera“, cam); The cMatrix variable is a cv::Matx33d (that is,a cv::Matx<double,3,3>) instance containing the intrinsic camera parameters as obtained from calibration. By default this camera is inserted at the origin of the coordinate system. To represent the bench, we used two rectangular cuboid objects. // Create a virtual bench from cuboids cv::viz::WCube plane1(cv::Point3f(0.0, 45.0, 0.0), cv::Point3f(242.5, 21.0, -9.0), true, // show wire frame cv::viz::Color::blue()); plane1.setRenderingProperty(cv::viz::LINE_WIDTH, 4.0); cv::viz::WCube plane2(cv::Point3f(0.0, 9.0, -9.0), cv::Point3f(242.5, 0.0, 44.5), true, // show wire frame cv::viz::Color::blue()); plane2.setRenderingProperty(cv::viz::LINE_WIDTH, 4.0); // Add the virtual objects to the environment visualizer.showWidget(“top“, plane1); visualizer.showWidget(“bottom“, plane2); This virtual bench is also added at the origin; it then needs to be moved at its camera-centric position as found from our cv::solvePnP function. It is the responsibility of the setWidgetPose method to perform this operation. This one simply applies the rotation and translation components of the estimated motion. cv::Mat rotation; // convert vector-3 rotation // to a 3x3 rotation matrix cv::Rodrigues(rvec, rotation); // Move the bench cv::Affine3d pose(rotation, tvec); visualizer.setWidgetPose(“top“, pose); visualizer.setWidgetPose(“bottom“, pose); The final step is to create a loop that keeps displaying the visualization window. The 1ms pause is there to listen to mouse events. // visualization loop while(cv::waitKey(100)==-1 && !visualizer.wasStopped()) { visualizer.spinOnce(1, // pause 1ms true); // redraw } This loop will stop when the visualization window is closed or when a key is pressed over an OpenCV image window. Try to apply inside this loop some motion on an object (using setWidgetPose); this is how animation can be created. See also Model-based object pose in 25 lines of code by D. DeMenthon and L. S. Davis, in European Conference on Computer Vision, 1992, pp.335–343 is a famous method for recovering camera pose from scene points. Summary This article teaches us how, under specific conditions, the 3D structure of the scene and the 3D pose of the cameras that captured it can be recovered. We have seen how a good understanding of projective geometry concepts allows to devise methods enabling 3D reconstruction. Resources for Article: Further resources on this subject: OpenCV: Image Processing using Morphological Filters [article] Learn computer vision applications in Open CV [article] Cardboard is Virtual Reality for Everyone [article]
Read more
  • 0
  • 0
  • 7668

article-image-solving-nlp-problem-keras-part-2
Sasank Chilamkurthy
13 Oct 2016
6 min read
Save for later

Solving an NLP Problem with Keras, Part 2

Sasank Chilamkurthy
13 Oct 2016
6 min read
In this two-part post series, we are solving a Natural Language Processing (NLP) problem with Keras. In Part 1, we covered the problem and the ATIS dataset we are using. We also went over the word embeddings (mapping words to a vector) along with Recurrent Neural Networks that solve complicated word tagging problems. We passed the word embedding sequence as input into the RNN and we then started coding that up. Now, it is time in this post to start loading the data. Loading Data Let's load the data using data.load.atisfull(). It will download the data the first time it is run. Words and labels are encoded as indexes to a vocabulary. This vocabulary is stored in w2idx and labels2idx. import numpy as np import data.load train_set, valid_set, dicts = data.load.atisfull() w2idx, labels2idx = dicts['words2idx'], dicts['labels2idx'] train_x, _, train_label = train_set val_x, _, val_label = valid_set # Create index to word/label dicts idx2w = {w2idx[k]:k for k in w2idx} idx2la = {labels2idx[k]:k for k in labels2idx} # For conlleval script words_train = [ list(map(lambda x: idx2w[x], w)) for w in train_x] labels_train = [ list(map(lambda x: idx2la[x], y)) for y in train_label] words_val = [ list(map(lambda x: idx2w[x], w)) for w in val_x] labels_val = [ list(map(lambda x: idx2la[x], y)) for y in val_label] n_classes = len(idx2la) n_vocab = len(idx2w) Let's print an example sentence and label: print("Example sentence : {}".format(words_train[0])) print("Encoded form: {}".format(train_x[0])) print() print("It's label : {}".format(labels_train[0])) print("Encoded form: {}".format(train_label[0])) Here is the output: Example sentence : ['i', 'want', 'to', 'fly', 'from', 'boston', 'at', 'DIGITDIGITDIGIT', 'am', 'and', 'arrive', 'in', 'denver', 'at', 'DIGITDIGITDIGITDIGIT', 'in', 'the', 'morning'] Encoded form: [232 542 502 196 208 77 62 10 35 40 58 234 137 62 11 234 481 321] It's label : ['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-depart_time.time', 'I-depart_time.time', 'O', 'O', 'O', 'B-toloc.city_name', 'O', 'B-arrive_time.time', 'O', 'O', 'B-arrive_time.period_of_day'] Encoded form: [126 126 126 126 126 48 126 35 99 126 126 126 78 126 14 126 126 12] Keras model Next, we define the Keras model. Keras has an inbuilt Embedding layer for word embeddings. It expects integer indices. SimpleRNN is the recurrent neural network layer described in Part 1. We will have to use TimeDistributed to pass the output of RNN Ot At each time step: t To a fully connected layer. Otherwise, the output at the final time step will be passed on to the next layer. from keras.models import Sequential from keras.layers.embeddings import Embedding from keras.layers.recurrent import SimpleRNN from keras.layers.core import Dense, Dropout from keras.layers.wrappers import TimeDistributed from keras.layers import Convolution1D model = Sequential() model.add(Embedding(n_vocab,100)) model.add(Dropout(0.25)) model.add(SimpleRNN(100,return_sequences=True)) model.add(TimeDistributed(Dense(n_classes, activation='softmax'))) model.compile('rmsprop', 'categorical_crossentropy') Training Now, let's start training our model. We will pass each sentence as a batch to the model. We cannot use model.fit() because it expects all of the sentences to be the same size. We will therefore use model.train_on_batch(). Training is very fast, since the dataset is relatively small. Each epoch takes 20 seconds on my Macbook Air. import progressbar n_epochs = 30 for i in range(n_epochs): print("Training epoch {}".format(i)) bar = progressbar.ProgressBar(max_value=len(train_x)) for n_batch, sent in bar(enumerate(train_x)): label = train_label[n_batch] # Make labels one hot label = np.eye(n_classes)[label][np.newaxis,:] # View each sentence as a batch sent = sent[np.newaxis,:] if sent.shape[1] >1: #ignore 1 word sentences model.train_on_batch(sent, label) Evaluation To measure the accuracy of the model, we use model.predict_on_batch() and metrics.accuracy.conlleval(). from metrics.accuracy import conlleval labels_pred_val = [] bar = progressbar.ProgressBar(max_value=len(val_x)) for n_batch, sent in bar(enumerate(val_x)): label = val_label[n_batch] label = np.eye(n_classes)[label][np.newaxis,:] sent = sent[np.newaxis,:] pred = model.predict_on_batch(sent) pred = np.argmax(pred,-1)[0] labels_pred_val.append(pred) labels_pred_val = [ list(map(lambda x: idx2la[x], y)) for y in labels_pred_val] con_dict = conlleval(labels_pred_val, labels_val, words_val, 'measure.txt') print('Precision = {}, Recall = {}, F1 = {}'.format( con_dict['r'], con_dict['p'], con_dict['f1'])) With this model, I get a 92.36 F1 Score. Precision = 92.07, Recall = 92.66, F1 = 92.36 Note that for the sake of brevity, I've not shown the logging part of the code. Loggging losses and accuracies are an important part of coding up an model. An improved model (described in the next section) with logging is at main.py. You can run it as : $ python main.py Improvements One drawback with our current model is that there is no look ahead, that is, output: ot This depends only on the current and previous words, but not on the words next to it. You can imagine clues about the properties of the current word that are also held by the next word. Lookahead can easily be implemented by having a convolutional layer before RNN and word embeddings: model = Sequential() model.add(Embedding(n_vocab,100)) model.add(Convolution1D(128, 5, border_mode='same', activation='relu')) model.add(Dropout(0.25)) model.add(GRU(100,return_sequences=True)) model.add(TimeDistributed(Dense(n_classes, activation='softmax'))) model.compile('rmsprop', 'categorical_crossentropy') With this improved model, I get a 94.90F1 Score! Conclusion In this two-part post series, you learned about word embeddings and RNNs. We applied these to an NLP problem: ATIS. We also made an improvement to our model. To improve the model further, you can try using word embeddings learned on a large site like Wikipedia. Also, there are variants of RNNs such as LSTM or GRU that can be experimented with. About the author Sasank Chilamkurthy works at Fractal Analytics. His work involves deep learning  on medical images obtained from radiology and pathology. He is mainly  interested in computer vision.
Read more
  • 0
  • 0
  • 3262

article-image-spark-beginners
Packt
13 Oct 2016
30 min read
Save for later

Spark for Beginners

Packt
13 Oct 2016
30 min read
In this article by Rajanarayanan Thottuvaikkatumana, author of the book Apache Spark 2 for Beginners, you will get an overview of Spark. By exampledata is one of the most important assets of any organization. The scale at which data is being collected and used in organizations is growing beyond imagination. The speed at which data is being ingested, the variety of the data types in use, and the amount of data that is being processed and stored are breaking all time records every moment. It is very common these days, even in small scale organizations, the data is growing from gigabytes to terabytes to petabytes. Just because of the same reason, the processing needs are also growing that asks for capability to process data at rest as well as data on the move. (For more resources related to this topic, see here.) Take any organization, its success depends on the decisions made by its leaders and for taking sound decisions, you need the backing of good data and the information generated by processing the data. This poses a big challenge on how to process the data in a timely and cost-effective manner so that right decisions can be made. Data processing techniques have evolved since the early days of computers. Countless data processing products and frameworks came into the market and disappeared over these years. Most of these data processing products and frameworks were not general purpose in nature. Most of the organizations relied on their own bespoke applications for their data processing needs in a silo way or in conjunction with specific products. Large-scale Internet applications popularly known as Internet of Things (IoT) applications heralded the common need to have open frameworks to process huge amount of data ingested at great speed dealing with various types of data. Large scale websites, media streaming applications, and huge batch processing needs of the organizations made the need even more relevant. The open source community is also growing considerably along with the growth of Internet delivering production quality software supported by reputed software companies. A huge number of companies started using open source software and started deploying them in their production environments. Apache Spark Spark is a Java Virtual Machine (JVM) based distributed data processing engine that scales, and it is fast as compared to many other data processing frameworks. Spark was born out of University of California, Berkeley, and later became one of the top projects in Apache. The research paper Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center talks about the philosophy behind the design of Spark. The research paper says: "To test the hypothesis that simple specialized frameworks provide value, we identified one class of jobs that were found to perform poorly on Hadoop by machine learning researchers at our lab: iterative jobs, where a dataset is reused across a number of iterations. We built a specialized framework called Spark optimized for these workloads." The biggest claim from Spark on the speed is Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark could make this claim because Spark does the processing in the main memory of the worker nodes andprevents the unnecessary I/O operations with the disks. The other advantage Spark offers is the ability to chain the tasks even at an application programming level without writing onto the disks at all or minimizing the number of writes to the disks. The Spark programming paradigm is very powerful and exposes a uniform programming model supporting the application development in multiple programming languages. Spark supports programming in Scala, Java, Python, and R even though there is no functional parity across all the programming languages supported. Apart from writing Spark applications in these programming languages, Spark has an interactive shell with Read, Evaluate, Print, and Loop (REPL) capabilities for the programming languages Scala, Python, and R. At this moment, there is no REPL support for Java in Spark. The Spark REPL is a very versatile tool that can be used to try and test Spark application code in an interactive fashion. The Spark REPL enables easy prototyping, debugging, and much more. In addition to the core data processing engine, Spark comes with a powerful stack of domain-specific libraries that use the core Spark libraries and provide various functionalities useful for various big data processing needs. The following list gives the supported libraries: Library Use Supported Languages Spark SQL Enables the use of SQL statements or DataFrame API inside Spark applications Scala, Java, Python, and R Spark Streaming Enables processing of live data streams Scala, Java, and Python Spark MLlib Enables development of machine learning applications Scala, Java, Python, and R Spark GraphX Enables graph processing and supports a growing library of graph algorithms Scala Understanding the Spark programming model Spark became an instant hit in the market because of its ability to process a huge amount of data types and growing number of data sources and data destinations. The most important and basic data abstraction Spark provides is the resilient distributed dataset (RDD). Spark supports distributed processing on a cluster of nodes. The moment there is a cluster of nodes, there are good chances that when the data processing is going on, some of the nodes can die. When such failures happen, the framework should be capable of coming out of such failures. Spark is designed to do that and that is what the resilient part in the RDD signifies. If there is a huge amount of data to be processed and there are nodes available in the cluster, the framework should have the capability to split the big dataset into smaller chunks and distribute them to be processed on more than one node in a cluster in parallel. Spark is capable of doing that and that is what the distributed part in the RDD signifies. In other words, Spark is designed from ground up to have its basic dataset abstraction capable of getting split into smaller pieces deterministically and distributed to more than one nodes in the cluster for parallel processing while elegantly handling the failures in the nodes. Spark RDD is immutable. Once an RDD is created, intentionally or unintentionally, it cannot be changed. This gives another insight into the construction of an RDD. There are some strong rules based on which an RDD is created. Because of that, when the nodes processing some part of an RDD die, the driver program can recreate those parts and assign the task of processing it to another node and ultimately completing the data processing job successfully. Since the RDD is immutable, splitting a big one to smaller ones, distributing them to various worker nodes for processing and finally compiling the results to produce the final result can be done safely without worrying about the underlying data getting changed. Spark RDD is distributable. If Spark is run in a cluster mode where there are multiple worker nodes available to take the tasks, all these nodes are having different execution contexts. The individual tasks are distributed and run on different JVMs. All these activities of a big RDD getting divided into smaller chunks, getting distributed for processing to the worker nodes and finally assembling the results back are completely hidden from the users. Spark has its on mechanism from recovering from the system faults and other forms of errors happening during the data processing.Hence this data abstraction is highly resilient. Spark RDD lives in memory (most of the time). Spark does keep all the RDDs in the memory as much as it can. Only in rare situations where Spark is running out of memory or if the data size is growing beyond the capacity, it is written into disk. Most of the processing on RDD happens in the memory and that is the reason why Spark is able to process the data in a lightning fast speed. Spark RDD is strongly typed. Spark RDD can be created using any supported data types. These data types can be Scala/Java supported intrinsic data types or custom created data types such as your own classes. The biggest advantage coming out of this design decision is the freedom from runtime errors. If it is going to break because of a data type issue, it will break during the compile time. Spark does the data processing using the RDDs. From the relevant data source such as text files, and NoSQL data stores, data is read to form the RDDs. On such an RDD, various data transformations are performed and finally the result is collected. To be precise, Spark comes with Spark Transformations and Spark Actions that act upon RDDs.Whenever a Spark Transformation is done on an RDD, a new RDD gets created. This is because RDDs are inherently immutable. These RDDs that are getting created at the end of each Spark Transformation can be saved for future reference or they will go out of scope eventually. The Spark Actions are used to return the computed values to the driver program. The process of creating one or more RDDs, apply transformations and actions on them is a very common usage pattern seen ubiquitously in Spark applications. Spark SQL Spark SQL is a library built on top of Spark. It exposes SQL interface, and DataFrame API. DataFrame API supports programming languages Scala, Java, Python and R. In programming languages such as R, there is a data frame abstraction used to store data tables in memory. The Python data analysis library named Pandas also has a similar data frame concept. Once that data structure is available in memory, the programs can extract the data, slice and dice the way as per the need. The same data table concept is extended to Spark known as DataFrame built on top of RDD and there is a very comprehensive API known as DataFrame API in Spark SQL to process the data in the DataFrame. An SQL-like query language is also developed on top of the DataFrame abstraction catering to the needs of the end users to query and process the underlying structured data. In summary, DataFrame is a distributed data table organized in rows and columns having names for each column. There is no doubt that SQL is the lingua franca for doing data analysis and Spark SQL is the answer from the Spark family of toolsets to do data analysis. So what it provides? It provides the ability to run SQL on top of Spark. Whether the data is coming from CSV, Avro, Parquet, Hive, NoSQL data stores such as Cassandra, or even RDBMS, Spark SQL can be used to analyze data and mix in with Spark programs. Many of the data sources mentioned here are supported intrinsically by Spark SQL and many others are supported by external packages. The most important aspect to highlight here is the ability of Spark SQL to deal with data from a very wide variety of data sources.Once it is available as a DataFrame in Spark, Spark SQL can process them in a completely distributed way, combine the DataFrames coming from various data sources to process, and query as if the entire dataset is coming from a single source. In the previous section, the RDD was discussed and introduced as the Spark programming model. Are the DataFrames API and the usage of SQL dialects in Spark SQL replacing RDD-based programming model? Definitely not! The RDD-based programming model is the generic and the basic data processing model in Spark. RDD-based programming requires the need to use real programming techniques. The Spark Transformations and Spark Actions use a lot of functional programming constructs. Even though the amount code that is required to be written in RDD-based programming model is less as compared to Hadoop MapReduce or any other paradigm, still there is a need to write some amount of functional code. The is is a barrier to entry enter for many data scientists, data analysts, and business analysts who may perform major exploratory kind of data analysis or doing some prototyping with the data. Spark SQL completely removes those constraints. Simple and easy-to-use domain-specific language (DSL) based methods to read and write data from data sources, SQL-like language to select, filter, aggregate, and capability to read data from a wide variety of data sources makes it easy for anybody who knows the data structure to use it. Which is the best use case to use RDD and which is the best use case to use Spark SQL? The answer is very simple. If the data is structured, it can be arranged in tables, and if each column can be given a name, then use Spark SQL. This doesn't mean that the RDD and DataFrame are two disparate entities. They interoperate very well. Conversions from RDD to DataFrame and vice versa are very much possible. Many of the Spark Transformations and Spark Actions that are typically applied on RDDs can also be applied on DataFrames. Interaction with Spark SQL library is done mainly through two methods. One is through SQL-like queries and the other is through DataFrame API. The Spark programming paradigm has many abstractions to choose from when it comes to developing data processing applications. The fundamentals of Spark programming starts with RDDs that can easily deal with unstructured, semi-structured, and structured data. The Spark SQL library offers highly optimized performance when processing structured data. This makes the basic RDDs look inferior in terms of performance. To fill this gap, from Spark 1.6 onwards, a new abstraction named Dataset was introduced that compliments the RDD-based Spark programming model. It works pretty much the same way as RDD when it comes to Spark Transformations and Spark Actions at the same time it is highly optimized like the Spark SQL. Dataset API provides strong compile-time type safety when it comes to writing programs and because of that the Dataset API is available only in Scala and Java. Too many choices confuses everybody. Here in the Spark programming model also the same problem is seen. But it is not as confusing as in many other programming paradigms. Whenever there is a need to process any kind of data with very high flexibility in terms of the data processing requirements and having the lowest API level control such as library development, RDD-based programming model is ideal. Whenever there is a need to process structured data with flexibility for accessing and processing data with optimized performance across all the supported programming languages, DataFrame-based Spark SQL programming model is ideal. Whenever there is a need to process unstructured data with optimized performance requirements as well as compile-time type safety but not very complex Spark Transformations and Spark Actions usage requirements, dataset-based programming model is ideal. At a data processing application development level, if the programming language of choice permits, it is better to use Dataset and DataFrame to have better performance. R on Spark A base R installation cannot interact with Spark. The SparkR package popularly known as R on Spark exposes all the required objects, and functions for R to talk to the Spark ecosystem. As compared to Scala, Java, and Python, the Spark programming in R is different and the SparkR package mainly exposes R API for DataFrame-based Spark SQL programming. At this moment, R cannot be used to manipulate the RDDs of Spark directly. So for all practical purposes, the R API for Spark has access to only Spark SQL abstractions. How SparkR is going to help the data scientists to do better data processing? The base R installation mandates that all the data to be stored (or accessible) on the computer where R is installed. The data processing happen on the single computer on which the R installation is available. More over if the data size is more than the main memory available on the computer, R will not be able to do the required processing. With SparkR package, there is an access to a whole new world of a cluster of nodes for data storage and for carrying out data processing. With the help of SparkR package, R can be used to access the Spark DataFrames as well as R DataFrames. It is very important to have a distinction of the two types of data frames. R DataFrame is completely local and a data structure of the R language. Spark DataFrame is a parallel collection of structured data managed by the Spark infrastructure. An R DataFrame can be converted to a Spark DataFrame. A Spark DataFrame can be converted to an R DataFrame. When a Spark DataFrame is converted to R DataFrame, it should fit in the available memory of the computer. This conversion is a great feature. By converting an R DataFrame to Spark DataFrame, the data can be distributed and processed in parallel. By converting a Spark DataFrame to an R DataFrame, many computations, charting and plotting that is done by other R functions can be done. In a nutshell, the SparkR package brings in the power of distributed and parallel computing capabilities to R. Many times when doing data processing with R, because of the sheer size of the data and the need to fit it into the main memory of the computer, the data processing is done in multiple batches and the results are consolidated to compute the final results. This kind of multibatch processing can be completely avoided if Spark with R is used to process the data. Many times reporting, charting, and plotting are done on the aggregated and summarized raw data. The raw data size can be huge and need not fit into one computer. In such cases, Spark with R can be used to process the entire raw data and finally the aggregated and summarized data can be used to produce the reports, charts, or plots. Because of the inability to process huge amount of data and for carrying data analysis with R, many times ETL tools are made to use for doing the preprocessing or transformations on the raw data.Only in the final stage the data analysis is done using R. Because of Spark's ability to process data at scale, Spark with R can replace the entire ETL pipeline and do the desired data analysis with R. SparkR package is yet another R package but that is not stopping anybody from using any of the R packages that are already being used. At the same time, it supplements the data processing capability of R manifold by making use of the huge data processing capabilities of Spark. Spark data analysis with Python The ultimate goal of processing data is to use the results for answering business questions. It is very important to understand the data that is being used to answer the business questions. To understand the data better, various tabulation methods, charting and plotting techniques are used. Visual representation of the data reinforces the understanding of the underlying data. Because of this, data visualization is used extensively in data analysis. There are different terms that are being used in various publications to mean the analysis of data for answering business questions. Data analysis, data analytics, business intelligence, and so on, are some of the ubiquitous terms floating around. This section is not going to delve into the discussion on the meaning, similarities or differences of these terms. On the other hand, the focus is going to be on how to bridge the gap of two major activities typically done by data scientists or data analysts. The first one being the data processing. The second one being the use of the processed data to do analysis with the help of charting and plotting. Data analysis is the forte of data analysts and data scientists. This book focuses on the usage of Spark and Python to process the data and produce charts and plots. In many data analysis use cases, a super set of data is processed and the reduced resultant dataset is used for the data analysis. This is specifically valid in the case of big data analysis, where a small set of processed data is used for analysis. Depending on the use case, for various data analysis needs an appropriate data processing is to be done as a prerequisite. Most of the use cases that are going to be covered in this book falls into this model where the first step deals with the necessary data processing and the second step deals with the charting and plotting required for the data analysis. In typical data analysis use cases, the chain of activities involves an extensive and multi staged Extract-Transform-Load (ETL) pipeline ending with a data analysis platform or application. The end result of this chain of activities include but not limited to tables of summary data and various visual representations of the data in the form of charts and plots. Since Spark can process data from heterogeneous distributed data sources very effectively, the huge ETL pipeline that existed in legacy data analysis applications can be consolidated into self contained applications that do the data processing and data analysis. Process data using Spark, analyze using Python Python is a programming language heavily used by the data analysts and data scientists these days. There are numerous scientific and statistical data processing libraries as well as charting and plotting libraries available that can be used in Python programs. It is also a widely used programming language to develop data processing applications in Spark. This brings in a great flexibility to have a unified data processing and data analysis framework with Spark, Python,and Python libraries, enabling to carry out scientific, and statistical processing, charting and plotting. There are numerous such libraries that work with Python. Out of all those, the NumPy and SciPylibraries are being used here to do numerical, statistical, and scientific data processing. The library matplotlib is being used here to carry out charting and plotting that produces 2D images. Processed data is used for data analysis. It requires deep understanding of the processed data. Charts and plots enhance the understanding of the characteristics of the underlying data. In essence, for a data analysis application, data processing, charting and plotting are essential. This book covers the usage of Spark with Python in conjunction with Python charting and plotting libraries for developing data analysis applications. Spark Streaming Data processing use cases can be mainly divided into two types. The first type is the use cases where the data is static and processing is done in its entirety as one unit of work or by dividing that into smaller batches. While doing the data processing, neither the underlying dataset changes nor new datasets get added to the processing units. This is batch processing. The second type is the use cases where the data is getting generated like a stream, and the processing is done as and when the data is generated. This is stream processing. Data sources generate data like a stream and many real-world use cases require them to be processed in a real-time fashion. The meaning of real-time can change from use case to use case. The main parameter that defines what is meant by realtime for a given use case is how soon the ingested data needs to be processed. Or the frequent interval in which all the data ingested since the last interval needs to be processed. For example, when a major sports event is happening, the application that consumes the score events and sending it to the subscribed users should be processing the data as fast as it can. The faster they can be sent, the better it is. But what is the definition of fast here? Is it fine to process the score data say after an hour of the score event happened? Probably not. Is it fine to process the data say after a minute of the score event happened? It is definitely better than processing after an hour. Is it fine to process the data say after a second of the score event happened? Probably yes, and much better than the earlier data processing time intervals. In any data stream processing use cases, this time interval is very important. The data processing framework should have the capability to process the data stream in an appropriate time interval of choice to deliver good business value. When processing stream data in regular intervals of choice, the data is collected from the beginning of the time interval to the end of the time interval, grouped them in a micro batch and data processing is done on that batch of data. Over an extended period of time, the data processing application would have processed many such micro batches of data. In this type of processing, the data processing application will have visibility to only the specific micro batch that is getting processed at a given point in time. In other words, the application will not have any visibility or access to the already processed micro batches of data. Now, there is another dimension to this type of processing. Suppose a given use case mandates the need to process the data every minute, but at the same time, while processing the data of a given micro batch, there is a need to peek into the data that was already processed in the last 15 minutes. A fraud detection module of a retail banking transaction processing application is a good example of this particular business requirement. There is no doubt that the retail banking transactions are to be processed within milliseconds of its occurrence. When processing an ATM cash withdrawal transaction, it is a good idea to see whether somebody is trying to continuously withdraw cash in quick succession and if found, send proper alerting. For this, when processing a given cash withdrawal transaction, check whether there are any other cash withdrawals from the same ATM using the same card happened in the last 15 minutes. The business rule is to alert when such transactions are more than two in the last 15 minutes. In this use case, the fraud detection application should have the visibility to all the transactions happened in a window of 15 minutes. A good stream data processing framework should have the capability to process the data in any given interval of time with ability to peek into the data ingested within a sliding window of time. The Spark Streaming library that is working on top of Spark is one of the best data stream processing framework that has both of these capabilities. Spark machine learning Calculations based on formulae or algorithms were very common since ancient times to find the output for a given input. But without knowing the formulae or algorithms, computer scientists and mathematicians devised methods to generate formulae or algorithms based on an existing set of input, output dataset and predict the output of a new input data based on the generated formulae or algorithms. Generally, this process of 'learning' from a dataset and doing predictions based on the 'learning' is known as Machine Learning. It has its origin from the study of artificial intelligence in computer science. Practical machine learning has numerous applications that are being consumed by the laymen on a daily basis. YouTube users now get suggestions for the next items to be played in the playlist based on the video they are currently viewing. Popular movie rating sites are giving ratings and recommendations based on the user preferences. Social media websites, such as Facebook, suggest a list of names of the users' friends for easy tagging of pictures. What Facebook is doing here is that, it is classifying the pictures by name that is already available in the albums and checking whether the newly added picture has any similarity with the existing ones. If it finds a similarity, it suggests the name.The applications of this kind of picture identification are many. The way all these applications are working is based on the huge amount of input, output dataset that is already collected and the learning done based on that dataset. When a new input dataset comes, a prediction is made by making use of the 'learning' that the computer or machine already did. In traditional computing, input data is fed to a program to generate output. But in machine learning, input data and output data are fed to a machine learning algorithm to generate a function or program that can be used to predict the output of an input according to the 'learning' done on the input, output dataset fed to the machine learning algorithm. The data available in the wild may be classified into groups, or it may form clusters or may fit into certain relationships. These are different kinds of machine learning problems. For example, if there is a databank of preowned car sale prices with its associated attributes or features, it is possible to predict the fair price of a car just by knowing the associated attributes or features. Regression algorithms are used to solve these kinds of problems. If there is a databank of spam and non-spam e-mails, then when a new mail comes, it is possible to predict whether the new mail is a spam or non-spam.Classification algorithms are used to solve these kind of problems. These are just a few machine learning algorithm types. But in general, when using a bank of data, if there is a need to apply a machine learning algorithm and using that model predictions are to be done, then the data should be divided into features and outputs. So whichever may be the machine learning algorithm that is being used, there will be a set of features and one or more output(s). Many books and publications use the term label for output. In other words, features are the input and label is the output. Data comes in various shapes and forms. Depending on the machine learning algorithm used, the training data has to be preprocessed to have the features and labels in the right format to be fed to the machine learning algorithm. That in turn generates the appropriate hypothesis function, which takes the features as the input and produces the predicted label. Why Spark for machine learning? Spark Machine learning library uses many Spark core functionalities as well as the Spark libraries such as Spark SQL. The Spark machine learning library makes the machine learning application development easy by combining data processing and machine learning algorithm implementations in a unified framework with ability to do data processing on a cluster of nodes combined with ability to read and write data to a variety of data formats. Spark comes with two flavors of the machine learning library. They are spark.mllib and spark.ml. The first one is developed on top of Spark's RDD abstraction and the second one is developed on top of Spark's DataFrame abstraction. It is recommended to use the spark.ml library for any future machine learning application developments. Spark graph processing Graph is a mathematical concept and a data structure in computer science. It has huge applications in many real-world use cases. It is used to model pair-wise relationship between entities. The entity here is known as Vertex and two vertices are connected by an Edge. A graph comprises of a collection of vertices and edges connecting them. Conceptually, it is a deceptively simple abstraction but when it comes to processing a huge number of vertices and edges, it is computationally intensive and consumes lots of processing time and computing resources. There are numerous application constructs that can be modeled as graph. In a social networking application, the relationship between users can be modeled as a graph in which the users form the vertices of the graph and the the relationship between users form the edges of the graph. In a multistage job scheduling application, the individual tasks form the vertices of the graph and the sequencing of the tasks forms the edges. In a road traffic modeling system, the towns form the vertices of the graph and the roads connecting the towns form the edges. The edges of a given graph have a very important property, namely, the direction of the connection. In many use cases, the direction of connection doesn't matter. In the case of connectivity between the cities by roads is one such example. But if the use case is to produce driving directions within a city, the connectivity between traffic-junctions has a direction. Take any two traffic-junctions and there will be a road connectivity, but it is possible that it is a oneway. So, it depends on in which direction the traffic is flowing. If the road is open for traffic from traffic-junction J1 to J2 but closed from J2 to J1, then the graph of driving directions will have a connectivity from J1 to J2 and not from J2 to J1. In such cases, the edge connecting J1 and J2 has a direction. If the traffic between J2 and J3 are open in both ways, then the the edge connecting J2 and J3 has no direction. A graph with all the edges having direction is called a directed graph. For graph processing, so many libraries are available in the open source world itself. Giraph, Pregel, GraphLab, and Spark GraphX are some of them. The Spark GraphX is one of the recent entrants into this space. What is so special about Spark GraphX? It is a graph processing library built on top of the Spark data processing framework. Compared to the other graph processing libraries, Spark GraphX has a real advantage. It can make use of all the data processing capabilities of Spark. In reality, the performance of graph processing algorithms is not the only one aspect that needs consideration. In many of the applications, the data that needs to be modeled as graph does not exist in that form naturally. In many use cases more than the graph processing, lot of processor time and other computing resources are expended to get the data in the right format so that the graph processing algorithms can be applied. This is the sweet spot where the combination of Spark data processing framework and Spark GraphX library delivering its most value. The data processing jobs to make the data ready to be consumed by the Spark GraphX can be easily done using the plethora of tools available in the Spark toolkit. In summary, the Spark GraphX library, which is part of the Spark family combines the power of the core data processing capabilities of Spark and a very easy to use graph processing library. The biggest limitation of Spark GraphX library is that its API is not currently supported with programming languages such as Python and R. But there is an external Spark package named GraphFrames that solves this limitation. Since GraphFrames is a DataFrame-based library, once it is matured, it will enable the graph processing in all the programming languages supported by DataFrames. This Spark external package is definitely a potential candidate to be included as part of the Spark itself. Summary Any technology learned or taught has to be concluded with an application developed covering its salient features. Spark is no different. This book, accomplishes an end-to-end application developed using Lambda Architecture using Spark as the data processing platform and its family of libraries built on top of it. Resources for Article: Further resources on this subject: Setting up Spark [article] Machine Learning Using Spark MLlib [article] Holistic View on Spark [article]
Read more
  • 0
  • 0
  • 2463

article-image-basics-image-histograms-opencv
Packt
12 Oct 2016
11 min read
Save for later

Basics of Image Histograms in OpenCV

Packt
12 Oct 2016
11 min read
In this article by Samyak Datta, author of the book Learning OpenCV 3 Application Development we are going to focus our attention on a different style of processing pixel values. The output of the techniques, which would comprise our study in the current article, will not be images, but other forms of representation for images, namely image histograms. We have seen that a two-dimensional grid of intensity values is one of the default forms of representing images in digital systems for processing as well as storage. However, such representations are not at all easy to scale. So, for an image with a reasonably low spatial resolution, say 512 x 512 pixels, working with a two-dimensional grid might not pose any serious issues. However, as the dimensions increase, the corresponding increase in the size of the grid may start to adversely affect the performance of the algorithms that work with the images. A primary advantage that an image histogram has to offer is that the size of a histogram is a constant that is independent of the dimensions of the image. As a consequence of this, we are guaranteed that irrespective of the spatial resolution of the images that we are dealing with, the algorithms that power our solutions will have to deal with a constant amount of data if they are working with image histograms. (For more resources related to this topic, see here.) Each descriptor captures some particular aspects or features of the image to construct its own form of representation. One of the common pitfalls of using histograms as a form of image representation as compared to its native form of using the entire two-dimensional grid of values is loss of information. A full-fledged image representation using pixel intensity values for all pixel locations naturally consists of all the information that you would need to reconstruct a digital image. However, the same cannot be said about histograms. When we study about image histograms in detail, we'll get to see exactly what information do we stand to lose. And this loss in information is prevalent across all forms of image descriptors. The basics of histograms At the outset, we will briefly explain the concept of a histogram. Most of you might already know this from your lessons on basic statistics. However, we will reiterate this for the sake of completeness. Histogram is a form of data representation technique that relies on an aggregation of data points. The data is aggregated into a set of predefined bins that are represented along the x axis, and the number of data points that fall within each of the bins make up the corresponding counts on the y axis. For example, let's assume that our data looks something like the following: D={2,7,1,5,6,9,14,11,8,10,13} If we define three bins, namely Bin_1 (1 - 5), Bin_2 (6 - 10), and Bin_3 (11 - 15), then the histogram corresponding to our data would look something like this: Bins Frequency Bin_1 (1 - 5) 3 Bin_2 (6 - 10) 5 Bin_3 (11 - 15) 3 What this histogram data tells us is that we have three values between 1 and 5, five between 6 and 10, and three again between 11 and 15. Note that it doesn't tell us what the values are, just that some n values exist in a given bin. A more familiar visual representation of the histogram in discussion is shown as follows: As you can see, the bins have been plotted along the x axis and their corresponding frequencies along the y axis. Now, in the context of images, how is a histogram computed? Well, it's not that difficult to deduce. Since the data that we have comprise pixel intensity values, an image histogram is computed by plotting a histogram using the intensity values of all its constituent pixels. What this essentially means is that the sequence of pixel intensity values in our image becomes the data. Well, this is in fact the simplest kind of histogram that you can compute using the information available to you from the image. Now, coming back to image histograms, there are some basic terminologies (pertaining to histograms in general) that you need to be aware of before you can dip your hands into code. We have explained them in detail here: Histogram size: The histogram size refers to the number of bins in the histogram. Range: The range of a histogram is the range of data that we are dealing with. The range of data as well as the histogram size are both important parameters that define a histogram. Dimensions: Simply put, dimensions refer to the number of the type of items whose values we aggregate in the histogram bins. For example, consider a grayscale image. We might want to construct a histogram using the pixel intensity values for such an image. This would be an example of a single-dimensional histogram because we are just interested in aggregating the pixel intensity values and nothing else. The data, in this case, is spread over a range of 0 to 255. On account of being one-dimensional, such histograms can be represented graphically as 2D plots—one-dimensional data (pixel intensity values) being plotted on the x axis (in the form of bins) along with the corresponding frequency counts along the y axis. We have already seen an example of this before. Now, imagine a color image with three channels: red, green, and blue. Let's say that we want to plot a histogram for the intensities in the red and green channels combined. This means that our data now becomes a pair of values (r, g). A histogram that is plotted for such data will have a dimensionality of 2. The plot for such a histogram will be a 3D plot with the data bins covering the x and y axes and the frequency counts plotted along the z axis. Now that we have discussed the theoretical aspects of image histograms in detail, let's start thinking along the lines of code. We will start with the simplest (and in fact the most ubiquitous) design of image histograms. The range of our data will be from 0 to 255 (both inclusive), which means that all our data points will be integers that fall within the specified range. Also, the number of data points will equal the number of pixels that make up our input image. The simplicity in design comes from the fact that we fix the size of the histogram (the number of bins) as 256. Now, take a moment to think about what this means. There are 256 different possible values that our data points can take and we have a separate bin corresponding to each one of those values. So such an image histogram will essentially depict the 256 possible intensity values along with the counts of the number of pixels in the image that are colored with each of the different intensities. Before taking a peek at what OpenCV has to offer, let's try to implement such a histogram on our own! We define a function named computeHistogram() that takes the grayscale image as an input argument and returns the image histogram. From our earlier discussions, it is evident that the histogram must contain 256 entries (for the 256 bins): one for each integer between 0 and 255. The value stored in the histogram corresponding to each of the 256 entries will be the count of the image pixels that have a particular intensity value. So, conceptually, we can use an array for our implementation such that the value stored in the histogram [ i ] (for 0≤i≤255) will be the count of the number of pixels in the image having the intensity of i. However, instead of using a C++ array, we will comply with the rules and standards followed by OpenCV and represent the histogram as a Mat object. We have already seen that a Mat object is nothing but a multidimensional array store. The implementation is outlined in the following code snippet: Mat computeHistogram(Mat input_image) { Mat histogram = Mat::zeros(256, 1, CV_32S); for (int i = 0; i < input_image.rows; ++i) { for (int j = 0; j < input_image.cols; ++j) { int binIdx = (int) input_image.at<uchar>(i, j); histogram.at<int>(binIdx, 0) += 1; } } return histogram; } As you can see, we have chosen to represent the histogram as a 256-element-column-vector Mat object. We iterate over all the pixels in the input image and keep on incrementing the corresponding counts in the histogram (which had been initialized to 0). As per our description of the image histogram properties, it is easy to see that the intensity value of any pixel is the same as the bin index that is used to index into the appropriate histogram bin to increment the count. Having such an implementation ready, let's test it out with the help of an actual image. The following code demonstrates a main() function that reads an input image, calls the computeHistogram() function that we have defined just now, and displays the contents of the histogram that is returned as a result: int main() { Mat input_image = imread("/home/samyak/Pictures/lena.jpg", IMREAD_GRAYSCALE); Mat histogram = computeHistogram(input_image); cout << "Histogram...n"; for (int i = 0; i < histogram.rows; ++i) cout << i << " : " << histogram.at<int>(i, 0) << "n"; return 0; } We have used the fact that the histogram that is returned from the function will be a single column Mat object. This makes the code that displays the contents of the histogram much cleaner. Histograms in OpenCV We have just seen the implementation of a very basic and minimalistic histogram using the first principles in OpenCV. The image histogram was basic in the sense that all the bins were uniform in size and comprised only a single pixel intensity. This made our lives simple when we designed our code for the implementation; there wasn't any need to explicitly check the membership of a data point (the intensity value of a pixel) with all the bins of our histograms. However, we know that a histogram can have bins whose sizes span more than one. Can you think of the changes that we might need to make in the code that we had written just now to accommodate for bin sizes larger than 1? If this change seems doable to you, try to figure out how to incorporate the possibility of non-uniform bin sizes or multidimensional histograms. By now, things might have started to get a little overwhelming to you. No need to worry. As always, OpenCV has you covered! The developers at OpenCV have provided you with a calcHist() function whose sole purpose is to calculate the histograms for a given set of arrays. By arrays, we refer to the images represented as Mat objects, and we use the term set because the function has the capability to compute multidimensional histograms from the given data: Mat computeHistogram(Mat input_image) { Mat histogram; int channels[] = { 0 }; int histSize[] = { 256 }; float range[] = { 0, 256 }; const float* ranges[] = { range }; calcHist(&input_image, 1, channels, Mat(), histogram, 1, histSize, ranges, true, false); return histogram; } Before we move on to an explanation of the different parameters involved in the calcHist() function call, I want to bring your attention to the abundant use of arrays in the preceding code snippet. Even arguments as simple as histogram sizes are passed to the function in the form of arrays rather than integer values, which at first glance seem quite unnecessary and counter-intuitive. The usage of arrays is due to the fact that the implementation of calcHist() is equipped to handle multidimensional histograms as well, and when we are dealing with such multidimensional histogram data, we require multiple parameters to be passed, one for each dimension. This would become clearer once we demonstrate an example of calculating multidimensional histograms using the calcHist() function. For the time being, we just wanted to clear the immediate confusion that might have popped up in your minds upon seeing the array parameters. Here is a detailed list of the arguments in the calcHist() function call: Source images Number of source images Channel indices Mask Dimensions (dims) Histogram size Ranges Uniform flag Accumulate flag The last couple of arguments (the uniform and accumulate flags) have default values of true and false, respectively. Hence, the function call that you have seen just now can very well be written as follows: calcHist(&input_image, 1, channels, Mat(), histogram, 1, histSize, ranges); Summary Thus in this article we have successfully studied fundamentals of using histograms in OpenCV for image processing. Resources for Article: Further resources on this subject: Remote Sensing and Histogram [article] OpenCV: Image Processing using Morphological Filters [article] Learn computer vision applications in Open CV [article]
Read more
  • 0
  • 0
  • 22139
article-image-solving-nlp-problem-keras-part-1
Sasank Chilamkurthy
12 Oct 2016
5 min read
Save for later

Solving an NLP Problem with Keras, Part 1

Sasank Chilamkurthy
12 Oct 2016
5 min read
In a previous two-part post series on Keras, I introduced Convolutional Neural Networks(CNNs) and the Keras deep learning framework. We used them to solve a Computer Vision (CV) problem involving traffic sign recognition. Now, in this two-part post series, we will solve a Natural Language Processing (NLP) problem with Keras. Let’s begin. The Problem and the Dataset The problem we are going to tackle is Natural Language Understanding. The aim is to extract the meaning of speech utterances. This is still an unsolved problem. Therefore, we can break this problem into a solvable practical problem of understanding the speaker in a limited context. In particular, we want to identify the intent of a speaker asking for information about flights. The dataset we are going to use is Airline Travel Information System (ATIS). This dataset was collected by DARPA in the early 90s. ATIS consists of spoken queries on flight related information. An example utterance is I want to go from Boston to Atlanta on Monday. Understanding this is then reduced to identifying arguments like Destination and Departure Day. This task is called slot-filling. Here is an example sentence and its labels. You will observe that labels are encoded in an Inside Outside Beginning (IOB) representation. Let’s look at the dataset: |Words | Show | flights | from | Boston | to | New | York| today| |Labels| O | O | O |B-dept | O|B-arr|I-arr|B-date| The ATIS official split contains 4,978/893 sentences for a total of 56,590/9,198 words (average sentence length is 15) in the train/test set. The number of classes (different slots) is 128, including the O label (NULL). Unseen words in the test set are encoded by the <UNK> token, and each digit is replaced with string DIGIT;that is,20 is converted to DIGITDIGIT. Our approach to the problem is to use: Word embeddings Recurrent neural networks I'll talk about these briefly in the following sections. Word Embeddings Word embeddings map words to a vector in a high-dimensional space. These word embeddings can actually learn the semantic and syntactic information of words. For instance, they can understand that similar words are close to each other in this space and dissimilar words are far apart. This can be learned either using large amounts of text like Wikipedia, or specifically for a given problem. We will take the second approach for this problem. As an illustation, I have shown here the nearest neighbors in the word embedding space for some of the words. This embedding space was learned by the model that we’ll define later in the post: sunday delta california boston august time car wednesday continental colorado nashville september schedule rental saturday united florida toronto july times limousine friday american ohio chicago june schedules rentals monday eastern georgia phoenix december dinnertime cars tuesday northwest pennsylvania cleveland november ord taxi thursday us north atlanta april f28 train wednesdays nationair tennessee milwaukee october limo limo saturdays lufthansa minnesota columbus january departure ap sundays midwest michigan minneapolis may sfo later Recurrent Neural Networks Convolutional layers can be a great way to pool local information, but they do not really capture the sequentiality of data. Recurrent Neural Networks (RNNs) help us tackle sequential information like natural language. If we are going to predict properties of the current word, we better remember the words before it too. An RNN has such an internal state/memory that stores the summary of the sequence it has seen so far. This allows us to use RNNs to solve complicated word tagging problems such as Part Of Speech (POS) tagging or slot filling, as in our case. The following diagram illustrates the internals of RNN:  Source: Nature RNN Let's briefly go through the diagram: Is the input to the RNN.   x_1,x_2,...,x_(t-1),x_t,x_(t+1)... Is the hidden state of the RNN at the step.  st This is computed based on the state at the step. t-1 As st=f(Uxt+Ws(t-1)) Here f is a nonlinearity such astanh or ReLU. ot Is the output at the step. t Computed as:ot=f(Vst)U,V,W Are the learnable parameters of RNN. For our problem, we will pass a word embeddings’ sequence as the input to the RNN. Putting it all together Now that we've setup the problem and have an understanding of the basic blocks, let's code it up. Since we are using the IOB representation for labels, it's not simpleto calculate the scores of our model. We therefore use the conlleval perl script to compute the F1 Scores. I've adapted the code from here for the data preprocessing and score calculation. The complete code is available at GitHub: $ git clone https://github.com/chsasank/ATIS.keras.git $ cd ATIS.keras I recommend using jupyter notebook to run and experiment with the snippets from the tutorial. $ jupyter notebook Conclusion In part 2, we will load the data using data.load.atisfull(). We will also define the Keras model, and then we will train the model. To measure the accuracy of the model, we’ll use model.predict_on_batch() and metrics.accuracy.conlleval(). And finally, we will improve our model to achieve better results. About the author Sasank Chilamkurthy works at Fractal Analytics. His work involves deep learning on medical images obtained from radiology and pathology. He is mainly interested in computer vision.
Read more
  • 0
  • 0
  • 4638

article-image-thinking-probabilistically
Packt
04 Oct 2016
16 min read
Save for later

Thinking Probabilistically

Packt
04 Oct 2016
16 min read
In this article by Osvaldo Martin, the author of the book Bayesian Analysis with Python, we will learn that Bayesian statistics has been developing for more than 250 years now. During this time, it has enjoyed as much recognition and appreciation as disdain and contempt. In the last few decades, it has gained an increasing amount of attention from people in the field of statistics and almost all the other sciences, engineering, and even outside the walls of the academic world. This revival has been possible due to theoretical and computational developments; modern Bayesian statistics is mostly computational statistics. The necessity for flexible and transparent models and more intuitive interpretation of the results of a statistical analysis has only contributed to the trend. (For more resources related to this topic, see here.) Here, we will adopt a pragmatic approach to Bayesian statistics and we will not care too much about other statistical paradigms and their relationship with Bayesian statistics. The aim of this book is to learn how to do Bayesian statistics with Python; philosophical discussions are interesting but they have already been discussed elsewhere in a much richer way than we could discuss in these pages. We will use a computational and modeling approach, and we will learn to think in terms of probabilistic models and apply Bayes' theorem to derive the logical consequences of our models and data. Models will be coded using Python and PyMC3, a great library for Bayesian statistics that hides most of the mathematical details of Bayesian analysis from the user. Bayesian statistics is theoretically grounded in probability theory, and hence it is no wonder that many books about Bayesian statistics are full of mathematical formulas requiring a certain level of mathematical sophistication. Nevertheless, programming allows us to learn and do Bayesian statistics with only modest mathematical knowledge. This is not to say that learning the mathematical foundations of statistics is useless; don't get me wrong, that could certainly help you build better models and gain an understanding of problems, models, and results. In this article, we will cover the following topics: Statistical modeling Probabilities and uncertainty Statistical modeling Statistics is about collecting, organizing, analyzing, and interpreting data, and hence statistical knowledge is essential for data analysis. Another useful skill when analyzing data is knowing how to write code in a programming language such as Python. Manipulating data is usually necessary given that we live in a messy world with even more messy data, and coding helps to get things done. Even if your data is clean and tidy, programming will still be very useful since, as will see, modern Bayesian statistics is mostly computational statistics. Most introductory statistical courses, at least for non-statisticians, are taught as a collection of recipes that more or less go like this; go to the the statistical pantry, pick one can and open it, add data to taste and stir until obtaining a consisting p-value, preferably under 0.05 (if you don't know what a p-value is, don't worry; we will not use them in this book). The main goal in this type of course is to teach you how to pick the proper can. We will take a different approach: we will also learn some recipes, but this will be home-made food rather than canned food; we will learn hot to mix fresh ingredients that will suit different gastronomic occasions. But before we can cook we must learn some statistical vocabulary and also some concepts. Exploratory data analysis Data is an essential ingredient of statistics. Data comes from several sources, such as experiments, computer simulations, surveys, field observations, and so on. If we are the ones that will be generating or gathering the data, it is always a good idea to first think carefully about the questions we want to answer and which methods we will use, and only then proceed to get the data. In fact, there is a whole branch of statistics dealing with data collection known as experimental design. In the era of data deluge, we can sometimes forget that getting data is not always cheap. For example, while it is true that the Large Hadron Collider (LHC) produces hundreds of terabytes a day, its construction took years of manual and intellectual effort. In this book we will assume that we already have collected the data and also that the data is clean and tidy, something rarely true in the real world. We will make these assumptions in order to focus on the subject of this book. If you want to learn how to use Python for cleaning and manipulating data and also a primer on statistics and machine learning, you should probably read Python Data Science Handbook by Jake VanderPlas. OK, so let's assume we have our dataset; usually, a good idea is to explore and visualize it in order to get some idea of what we have in our hands. This can be achieved through what is known as Exploratory Data Analysis (EDA), which basically consists of the following: Descriptive statistics Data visualization The first one, descriptive statistics, is about how to use some measures (or statistics) to summarize or characterize the data in a quantitative manner. You probably already know that you can describe data using the mean, mode, standard deviation, interquartile ranges, and so forth. The second one, data visualization, is about visually inspecting the data; you probably are familiar with representations such as histograms, scatter plots, and others. While EDA was originally thought of as something you apply to data before doing any complex analysis or even as an alternative to complex model-based analysis, through the book we will learn that EDA is also applicable to understanding, interpreting, checking, summarizing, and communicating the results of Bayesian analysis. Inferential statistics Sometimes, plotting our data and computing simple numbers, such as the average of our data, is all what we need. Other times, we want to go beyond our data to understand the underlying mechanism that could have generated the data, or maybe we want to make predictions for future data, or we need to choose among several competing explanations for the same data. That's the job of inferential statistics. To do inferential statistics we will rely on probabilistic models. There are many types of model and most of science, and I will add all of our understanding of the real world, is through models. The brain is just a machine that models reality (whatever reality might be) http://www.tedxriodelaplata.org/videos/m%C3%A1quina-construye-realidad. What are models? Models are a simplified descriptions of a given system (or process). Those descriptions are purposely designed to capture only the most relevant aspects of the system, and hence, most models do not try to pretend they are able to explain everything; on the contrary, if we have a simple and a complex model and both models explain the data well, we will generally prefer the simpler one. Model building, no matter which type of model you are building, is an iterative process following more or less the same basic rules. We can summarize the Bayesian modeling process using three steps: Given some data and some assumptions on how this data could have been generated, we will build models. Most of the time, models will be crude approximations, but most of the time this is all we need. Then we will use Bayes' theorem to add data to our models and derive the logical consequences of mixing the data and our assumptions. We say we are conditioning the model on our data. Lastly, we will check that the model makes sense according to different criteria, including our data and our expertise on the subject we are studying. In general, we will find ourselves performing these three steps in a non-linear iterative fashion. Sometimes we will retrace our steps at any given point: maybe we made a silly programming mistake, maybe we found a way to change the model and improve it, maybe we need to add more data. Bayesian models are also known as probabilistic models because they are built using probabilities. Why probabilities? Because probabilities are the correct mathematical tool for dealing with uncertainty in our data and models, so let's take a walk through the garden of forking paths. Probabilities and uncertainty While probability theory is a mature and well-established branch of mathematics, there is more than one interpretation of what probabilities are. To a Bayesian, a probability is a measure that quantifies the uncertainty level of a statement. If we know nothing about coins and we do not have any data about coin tosses, it is reasonable to think that the probability of a coin landing heads could take any value between 0 and 1; that is, in the absence of information, all values are equally likely, our uncertainty is maximum. If we know instead that coins tend to be balanced, then we may say that the probability of acoin landing is exactly 0.5 or may be around 0.5 if we admit that the balance is not perfect. If we collect data, we can update these prior assumptions and hopefully reduce the uncertainty about the bias of the coin. Under this definition of probability, it is totally valid and natural to ask about the probability of life on Mars, the probability of the mass of the electron being 9.1 x 10-31 kg, or the probability of the 9th of July of 1816 being a sunny day. Notice for example that life on Mars exists or not; it is a binary outcome, but what we are really asking is how likely is it to find life on Mars given our data and what we know about biology and the physical conditions on that planet? The statement is about our state of knowledge and not, directly, about a property of nature. We are using probabilities because we can not be sure about the events, not because the events are necessarily random. Since this definition of probability is about our epistemic state of mind, sometimes it is referred to as the subjective definition of probability, explaining the slogan of subjective statistics often attached to the Bayesian paradigm. Nevertheless, this definition does not mean all statements should be treated as equally valid and so anything goes; this definition is about acknowledging that our understanding about the world is imperfect and conditioned by the data and models we have made. There is not such a thing as a model-free or theory-free understanding of the world; even if it will be possible to free ourselves from our social preconditioning, we will end up with a biological limitation: our brain, subject to the evolutionary process, has been wired with models of the world. We are doomed to think like humans and we will never think like bats or anything else! Moreover, the universe is an uncertain place and all we can do is make probabilistic statements about it. Notice that does not matter if the underlying reality of the world is deterministic or stochastic; we are using probability as a tool to quantify uncertainty. Logic is about thinking without making mistakes. In Aristotelian or classical logic, we can only have statements that are true or false. In Bayesian definition of probability, certainty is just a special case: a true statement has a probability of 1, a false one has probability. We would assign a probability of 1 about life on Mars only after having conclusive data indicating something is growing and reproducing and doing other activities we associate with living organisms. Notice, however, that assigning a probability of 0 is harder because we can always think that there is some Martian spot that is unexplored, or that we have made mistakes with some experiment, or several other reasons that could lead us to falsely believe life is absent on Mars when it is not. Interesting enough, Cox mathematically proved that if we want to extend logic to contemplate uncertainty we must use probabilities and probability theory, from which Bayes' theorem is just a logical consequence as we will see soon. Hence, another way of thinking about Bayesian statistics is as an extension of logic when dealing with uncertainty, something that clearly has nothing to do with subjective reasoning in the pejorative sense. Now that we know the Bayesian interpretation of probability, let's see some of the mathematical properties of probabilities. For a more detailed study of probability theory, you can read Introduction to probability by Joseph K Blitzstein & Jessica Hwang. Probabilities are numbers in the interval [0, 1], that is, numbers between 0 and 1, including both extremes. Probabilities follow some rules; one of these rules is the product rule: We read this as follows: the probability of A and B is equal to the probability of A given B, multiplied by the probability of B. The expression p(A|B) is used to indicate a conditional probability; the name refers to the fact that the probability of A is conditioned by knowing B. For example, the probability that a pavement is wet is different from the probability that the pavement is wet if we know (or given that) is raining. In fact, a conditional probability is always larger than or equal to the unconditioned probability. If knowing B does not provides us with information about A, then p(A|B)=p(A). That is A and B are independent of each other. On the contrary, if knowing B give as useful information about A, then p(A|B) > p(A). Conditional probabilities are a key concept in statistics, and understanding them is crucial to understanding Bayes' theorem, as we will see soon. Let's try to understand them from a different perspective. If we reorder the equation for the product rule, we get the following: Hence, p(A|B) is the probability that both A and B happens, relative to the probability of B happening. Why do we divide by p(B)? Knowing B is equivalent to saying that we have restricted the space of possible events to B and thus, to find the conditional probability, we take the favorable cases and divide them by the total number of events. Is important to realize that all probabilities are indeed conditionals, there is not such a thing as an absolute probability floating in vacuum space. There is always some model, assumptions, or conditions, even if we don't notice or know them. The probability of rain is not the same if we are talking about Earth, Mars, or some other place in the Universe, the same way the probability of a coin landing heads or tails depends on our assumptions of the coin being biased in one way or another. Now that we are more familiar with the concept of probability, let's jump to the next topic, probability distributions. Probability distributions A probability distribution is a mathematical object that describes how likely different events are. In general, these events are restricted somehow to a set of possible events. A common and useful conceptualization in statistics is to think that data was generated from some probability distribution with unobserved parameters. Since the parameters are unobserved and we only have data, we will use Bayes' theorem to invert the relationship, that is, to go from the data to the parameters. Probability distributions are the building blocks of Bayesian models; by combining them in proper ways we can get useful complex models. We will meet several probability distributions throughout the book; every time we discover one we will take a moment to try to understand it. Probably the most famous of all of them is the Gaussian or normal distribution. A variable x follows a Gaussian distribution if its values are dictated by the following formula: In the formula, and are the parameters of the distributions. The first one can take any real value, that is, , and dictates the mean of the distribution (and also the median and mode, which are all equal). The second is the standard deviation, which can only be positive and dictates the spread of the distribution. Since there are an infinite number of possible combinations of and values, there is an infinite number of instances of the Gaussian distribution and all of them belong to the same Gaussian family. Mathematical formulas are concise and unambiguous and some people say even beautiful, but we must admit that meeting them can be intimidating; a good way to break the ice is to use Python to explore them. Let's see what the Gaussian distribution family looks like: import matplotlib.pyplot as plt import numpy as np from scipy import stats import seaborn as sns mu_params = [-1, 0, 1] sd_params = [0.5, 1, 1.5] x = np.linspace(-7, 7, 100) f, ax = plt.subplots(len(mu_params), len(sd_params), sharex=True, sharey=True) for i in range(3): for j in range(3): mu = mu_params[i] sd = sd_params[j] y = stats.norm(mu, sd).pdf(x) ax[i,j].plot(x, y) ax[i,j].set_ylim(0, 1) ax[i,j].plot(0, 0, label="$\alpha$ = {:3.2f}n$\beta$ = {:3.2f}".format(mu, sd), alpha=0) ax[i,j].legend() ax[2,1].set_xlabel('$x$') ax[1,0].set_ylabel('$pdf(x)$') The output of the preceding code is as follows: A variable, such as x, that comes from a probability distribution is called a random variable. It is not that the variable can take any possible value. On the contrary, the values are strictly dictated by the probability distribution; the randomness arises from the fact that we could not predict which value the variable will take, but only the probability of observing those values. A common notation used to say that a variable is distributed as a Gaussian or normal distribution with parameters and is as follows: The symbol ~ is read as is distributed as. There are two types of random variable, continuous and discrete. Continuous variables can take any value from some interval (we can use Python floats to represent them), and the discrete variables can take only certain values (we can use Python integers to represent them). Many models assume that successive values of a random variables are all sampled from the same distribution and those values are independent of each other. In such a case, we will say that the variables are independently and identically distributed, or iid variables for short. Using mathematical notation, we can see that two variables are independent if for every value of x and y: A common example of non iid variables are temporal series, where a temporal dependency in the random variable is a key feature that should be taken into account. Summary In this article we shall take up a practical approach to Bayesian statistics and discover how to implement Bayesian statistics with Python. Here we will learn to think of problems in terms of their probability and uncertainty and apply the Bayes' theorem to derive their results. Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Mining Twitter with Python – Influence and Engagement [article] Exception Handling in MySQL for Python [article]
Read more
  • 0
  • 0
  • 2255

article-image-supervised-machine-learning
Packt
04 Oct 2016
13 min read
Save for later

Supervised Machine Learning

Packt
04 Oct 2016
13 min read
In this article by Anshul Joshi, the author of the book Julia for Data Science, we will learn that data science involves understanding data, gathering data, munging data, taking the meaning out of that data, and then machine learning if needed. Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. (For more resources related to this topic, see here.) The key features offered by Julia are: A general purpose high-level dynamic programming language designed to be effective for numerical and scientific computing A Low-Level Virtual Machine (LLVM) based Just-in-Time (JIT) compiler that enables Julia to approach the performance of statically-compiled languages like C/C++ What is machine learning? Generally, when we talk about machine learning, we get into the idea of us fighting wars with intelligent machines that we created but went out of control. These machines are able to outsmart the human race and become a threat to human existence. These theories are nothing but created for our entertainment. We are still very far away from such machines. So, the question is: what is machine learning? Tom M. Mitchell gave a formal definition- "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." It says that machine learning is teaching computers to generate algorithms using data without programming them explicitly. It transforms data into actionable knowledge. Machine learning has close association with statistics, probability, and mathematical optimization. As technology grew, there is one thing that grew with it exponentially—data. We have huge amounts of unstructured and structured data growing at a very great pace. Lots of data is generated by space observatories, meteorologists, biologists, fitness sensors, surveys, and so on. It is not possible to manually go through this much amount of data and find patterns or gain insights. This data is very important for scientists, domain experts, governments, health officials, and even businesses. To gain knowledge out of this data, we need self-learning algorithms that can help us in decision making. Machine learning evolved as a subfield of artificial intelligence, which eliminates the need to manually analyze large amounts of data. Instead of using machine learning, we make data-driven decisions by gaining knowledge using self-learning predictive models. Machine learning has become important in our daily lives. Some common use cases include search engines, games, spam filters, and image recognition. Self-driving cars also use machine learning. Some basic terminologies used in machine learning: Features: Distinctive characteristics of the data point or record Training set: This is the dataset that we feed to train the algorithm that helps us to find relationships or build a model Testing set: The algorithm generated using the training dataset is tested on the testing dataset to find the accuracy Feature vector: An n-dimensional vector that contains the features defining an object Sample: An item from the dataset or the record Uses of machine learning Machine learning in one way or another is used everywhere. Its applications are endless. Let's discuss some very common use cases: E-mail spam filtering: Every major e-mail service provider uses machine learning to filter out spam messages from the Inbox to the Spam folder. Predicting storms and natural disasters: Machine learning is used by meteorologists and geologists to predict the natural disasters using weather data, which can help us to take preventive measures. Targeted promotions/campaigns and advertising: On social sites, search engines, and maybe in mailboxes, we see advertisements that somehow suit our taste. This is made feasible using machine learning on the data from our past searches, our social profile or the e-mail contents. Self-driving cars: Technology giants are currently working on self driving cars. This is made possible using machine learning on the feed of the actual data from human drivers, image and sound processing, and various other factors. Machine learning is also used by businesses to predict the market. It can also be used to predict the outcomes of elections and the sentiment of voters towards a particular candidate. Machine learning is also being used to prevent crime. By understanding the pattern of the different criminals, we can predict a crime that can happen in future and can prevent it. One case that got a huge amount of attention was of a big retail chain in the United States using machine learning to identify pregnant women. The retailer thought of the strategy to give discounts on multiple maternity products, so that they would become loyal customers and will purchase items for babies which have a high profit margin. The retailer worked on the algorithm to predict the pregnancy using useful patterns in purchases of different products which are useful for pregnant women. Once a man approached the retailer and asked for the reason that his teenage daughter is receiving discount coupons for maternity items. The retail chain offered an apology but later the father himself apologized when he got to know that his daughter was indeed pregnant. This story may or may not be completely true, but retailers indeed analyze their customers' data routinely to find out patterns and for targeted promotions, campaigns, and inventory management. Machine learning and ethics Let's see where machine learning is used very frequently: Retailers: In the previous example, we mentioned how retail chains use data for machine learning to increase their revenue as well as to retain their customers. Spam filtering: E-mails are processed using various machine learning algorithms for spam filtering. Targeted advertisements: In our mailbox, social sites, or search engines, we see advertisements of our liking. These are only some of the actual use cases that are implemented in the world today. One thing that is common between them is the user data. In the first example, retailers are using the history of transactions done by the user for targeted promotions and campaigns and for inventory management, among other things. Retail giants do this by providing users a loyalty or sign-up card. In the second example, the e-mail service provider uses trained machine learning algorithms to detect and flag spam. It does by going through the contents of e-mail/attachments and classifying the sender of the e-mail. In the third example, again the e-mail provider, social network, or search engine will go through our cookies, our profile, or our mails to do the targeted advertising. In all of these examples, it is mentioned in the terms and conditions of the agreement when we sign up with the retailer, e-mail provider, or social network that the user's data will be used but privacy will not be violated. It is really important that before using data that is not publicly available, we take the required permissions. Also, our machine learning models shouldn't do discrimination on the basis of region, race, and sex or of any other kind. The data provided should not be used for purposes not mentioned in the agreement or illegal in the region or country of existence. Machine learning – the process Machine learning algorithms are trained in keeping with the idea of how the human brain works. They are somewhat similar. Let's discuss the whole process. The machine learning process can be described in three steps: Input Abstraction Generalization These three steps are the core of how the machine learning algorithm works. Although the algorithm may or may not be divided or represented in such a way, this explains the overall approach. The first step concentrates on what data should be there and what shouldn't. On the basis of that, it gathers, stores, and cleans the data as per the requirements. The second step involves that the data be translated to represent the bigger class of data. This is required as we cannot capture everything and our algorithm should not be applicable for only the data that we have. The third step focuses on the creation of the model or an action that will use this abstracted data, which will be applicable for the broader mass. So, what should be the flow of approaching a machine learning problem? In this particular figure, we see that the data goes through the abstraction process before it can be used to create the machine learning algorithm. This process itself is cumbersome. The process follows the training of the model, which is fitting the model into the dataset that we have. The computer does not pick up the model on its own, but it is dependent on the learning task. The learning task also includes generalizing the knowledge gained on the data that we don't have yet. Therefore, training the model is on the data that we currently have and the learning task includes generalization of the model for future data. It depends on our model how it deduces knowledge from the dataset that we currently have. We need to make such a model that can gather insights into something that wasn't known to us before and how it is useful and can be linked to the future data. Different types of machine learning Machine learning is divided mainly into three categories: Supervised learning Unsupervised learning Reinforcement learning In supervised learning, the model/machine is presented with inputs and the outputs corresponding to those inputs. The machine learns from these inputs and applies this learning in further unseen data to generate outputs. Unsupervised learning doesn't have the required outputs; therefore it is up to the machine to learn and find patterns that were previously unseen. In reinforcement learning, the machine continuously interacts with the environment and learns through this process. This includes a feedback loop. Understanding decision trees Decision tree is a very good example of divide and conquer. It is one of the most practical and widely used methods for inductive inference. It is a supervised learning method that can be used for both classification and regression. It is non-parametric and its aim is to learn by inferring simple decision rules from the data and create such a model that can predict the value of the target variable. Before taking a decision, we analyze the probability of the pros and cons by weighing the different options that we have. Let's say we want to purchase a phone and we have multiple choices in the price segment. Each of the phones has something really good, and maybe better than the other. To make a choice, we start by considering the most important feature that we want. And like this, we create a series of features that it has to pass to become the ultimate choice. In this section, we will learn about: Decision trees Entropy measures Random forests We will also learn about famous decision tree learning algorithms such as ID3 and C5.0. Decision tree learning algorithms There are various decision tree learning algorithms that are actually variations of the core algorithm. The core algorithm is actually a top-down, greedy search through all possible trees. We are going to discuss two algorithms: ID3 C4.5 and C5.0 The first algorithm, Iterative Dichotomiser 3 (ID3), was developed by Ross Quinlan in 1986. The algorithm proceeds by creating a multiway tree, where it uses greedy search to find each node and the features that can yield maximum information gain for the categorical targets. As trees can grow to the maximum size, which can result in over-fitting of data, pruning is used to make the generalized model. C4.5 came after ID3 and eliminated the restriction that all features must be categorical. It does this by defining dynamically a discrete attribute based on the numerical variables. This partitions into a discrete set of intervals from the continuous attribute value. C4.5 creates sets of if-then rules from the trained trees of the ID3 algorithm. C5.0 is the latest version; it builds smaller rule sets and uses comparatively lesser memory. An example Let's apply what we've learned to create a decision tree using Julia. We will be using the example available for Python on scikit-learn.org and Scikitlearn.jl by Cedric St-Jean. We will first have to add the required packages: We will first have to add the required packages: julia> Pkg.update() julia> Pkg.add("DecisionTree") julia> Pkg.add("ScikitLearn") julia> Pkg.add("PyPlot") ScikitLearn provides the interface to the much-famous library of machine learning for Python to Julia: julia> using ScikitLearn julia> using DecisionTree julia> using PyPlot After adding the required packages, we will create the dataset that we will be using in our example: julia> # Create a random dataset julia> srand(100) julia> X = sort(5 * rand(80)) julia> XX = reshape(X, 80, 1) julia> y = sin(X) julia> y[1:5:end] += 3 * (0.5 – rand(16)) This will generate a 16-element Array{Float64,1}. Now we will create instances of two different models. One model is where we will not limit the depth of the tree, and in other model, we will prune the decision tree on the basis of purity: We will now fit the models to the dataset that we have. We will fit both the models. This is the first model. Here our decision tree has 25 leaf nodes and a depth of 8. This is the second model. Here we prune our decision tree. This has six leaf nodes and a depth of 4. Now we will use the models to predict on the test dataset: julia> # Predict julia> X_test = 0:0.01:5.0 julia> y_1 = predict(regr_1, hcat(X_test)) julia> y_2 = predict(regr_2, hcat(X_test)) This creates a 501-element Array{Float64,1}. To better understand the results, let's plot both the models on the dataset that we have: julia> # Plot the results julia> scatter(X, y, c="k", label="data") julia> plot(X_test, y_1, c="g", label="no pruning", linewidth=2) julia> plot(X_test, y_2, c="r", label="pruning_purity_threshold=0.05", linewidth=2) julia> xlabel("data") julia> ylabel("target") julia> title("Decision Tree Regression") julia> legend(prop=Dict("size"=>10)) Decision trees can tend to overfit data. It is required to prune the decision tree to make it more generalized. But if we do more pruning than required, then it may lead to an incorrect model. So, it is required that we find the most optimized pruning level. It is quite evident that the first decision tree overfits to our dataset, whereas the second decision tree model is comparatively more generalized. Summary In this article, we learned about machine learning and its uses. Providing computers the ability to learn and improve has far-reaching uses in this world. It is used in predicting disease outbreaks, predicting weather, games, robots, self-driving cars, personal assistants, and lot more. There are three different types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. We also learned about decision trees. Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Basics of Programming in Julia [article] More about Julia [article]
Read more
  • 0
  • 0
  • 1994
article-image-parallel-computing
Packt
30 Sep 2016
9 min read
Save for later

Parallel Computing

Packt
30 Sep 2016
9 min read
In this article written by Jalem Raj Rohit, author of the book Julia Cookbook, cover the following recipes: Basic concepts of parallel computing Data movement Parallel map and loop operations Channels (For more resources related to this topic, see here.) Introduction In this article, you will learn about performing parallel computing and using it to handle big data. So, some concepts like data movements, sharded arrays, and the map-reduce framework are important to know in order to handle large amounts of data by computing on it using parallelized CPUs. So, all the concepts discussed in this article will help you build good parallel computing and multiprocessing basics, including efficient data handling and code optimization. Basic concepts of parallel computing Parallel computing is a way of dealing with data in a parallel way. This can be done by connecting multiple computers as a cluster and using their CPUs for carrying out the computations. This style of computation is used when handling large amounts of data and also while running complex algorithms over significantly large data. The computations are executed faster due to the availability of multiple CPUs running them in parallel as well as the direct availability of RAM to each of them. Getting ready Julia has an in-built support for parallel computing and multiprocessing. So, these computations rarely require any external libraries for the task. How to do it… Julia can be started in your local computer using multiple cores of your CPU. So, we will now have multiple workers for the process. This is how you can fire up Julia in the multi-processing mode in your terminal. This creates two worker process in the machine, which means it uses twwo CPU cores for the purpose julia -p 2 The output looks something like this. It might differ for different operating systems and different machines: Now, we will look at the remotecall() function. It takes in multiple arguments, the first one being the process which we want to assign the task to. The next argument would be the function which we want to execute. The subsequent arguments would be the parameters or the arguments of that function which we want to execute. In this example, we will create a 2 x 2 random matrix and assign it to the process number 2. This can be done as follows: task = remotecall(2, rand, 2, 2) The preceding command gives the following output: Now that the remotecall() function for remote referencing has been executed, we will fetch the results of the function through the fetch() function. This can be done as follows: fetch(task) The preceding command gives the following output: Now, to perform some mathematical operations on the generated matrix, we can use the @spawnat macro, which takes in the mathematical operation and the fetch() function. The @spawnat macro actually wraps the expression 5 .+ fetch(task) into an anonymous function and runs it on the second machine This can be done as follows: task2 = @spawnat 5 .+ fetch(task) There is also a function that eliminates the need of using two different functions: remotecall() and fetch(). The remotecall_fetch() function takes in multiple arguments. The first one being the process that the task is being assigned. The next argument is the function which you want to be executed. The subsequent arguments would be the arguments or the parameters of the function that you want to execute. Now, we will use the remote call_fetch() function to fetch an element of the task matrix for a particular index. This can be done as follows: remotecall_fetch(2, getindex, task2, 1, 1) How it works… Julia can be started in the multiprocessing mode by specifying the number of processes needed while starting up the REPL. In this example, we started Julia as a two process mode. The maximum number of processes depends on the number of cores available in the CPU. The remotecall() function helps in selecting a particular process from the running processes in order to run a function or, in fact, any computation for us. The fetch() function is used to fetch the results of the remotecall() function from a common data resource (or the process) for all the running processes. The details of the data source would be covered in the later sections. The results of the fetch() function can also be used for further computations, which can be carried out with the @spawnat macro along with the results of fetch(). This would assign a process for the computation. The remotecall_fetch() function further eliminates the need for the fetch function in case of a direct execution. This has both the remotecall() and fetch() operations built into it. So, it acts as a combination of both the second and third points in this section. Data movement In parallel computing, data movements are quite common and are also a thing to be minimized due to the time and the network overhead due to the movements. In this recipe, we will see how that can be optimized to avoid latency as much as we can. Getting ready To get ready for this recipe, you need to have the Julia REPL started in the multiprocessing mode. This is explained in the Getting ready section of the preceding recipe. How to do it… Firstly, we will see how to do a matrix computation using the @spawn macro, which helps in data movement. So, we construct a matrix of shape 200 x 200 and then try to square it using the @spawn macro. This can be done as follows: mat = rand(200, 200) exec_mat = @spawn mat^2 fetch(exec_mat) The preceding command gives the following output: Now, we will look at an another way to achieve the same. This time, we will use the @spawn macro directly instead of the initialization step. We will discuss the advantages and drawbacks of each method in the How it works… section. So, this can be done as follows: mat = @spawn rand(200, 200)^2 fetch(mat) The preceding command gives the following output: How it works… In this example, we try to construct a 200X200 matrix and then used the @spawn macro to spawn a process in the CPU to execute the same for us. The @spawn macro spawns one of the two processes running, and it uses one of them for the computation. In the second example, you learned how to use the @spawn macro directly without an extra initialization part. The fetch() function helps us fetch the results from a common data resource of the processes. More on this will be covered in the following recipes. Parallel maps and loop operations In this recipe, you will learn a bit about the famous Map Reduce framework and why it is one of the most important ideas in the domains of big data and parallel computing. You will learn how to parallelize loops and use reducing functions on them through the several CPUs and machines and the concept of parallel computing, which you learned about in the previous recipes. Getting ready Just like the previous sections, Julia just needs to be running in the multiprocessing mode to follow along the following examples. This can be done through the instructions given in the first section. How to do it… Firstly, we will write a function that takes and adds n random bits. The writing of this function has nothing to do with multiprocessing. So, it has simple Julia functions and loops. This function can be written as follows: Now, we will use the @spawn macro, which we learned previously to run the count_heads() function as separate processes. The count_heads()function needs to be in the same directory for this to work. This can be done as follows: require("count_heads") a = @spawn count_heads(100) b = @spawn count_heads(100) fetch(a) + fetch(b) However, we can use the concept of multi-processing and parallelize the loop directly as well as take the sum. The parallelizing part is called mapping, and the addition of the parallelized bits is called reduction. Thus, the process constitutes the famous Map-Reduce framework. This can be made possible using the @parallel macro, as follows: nheads = @parallel (+) for i = 1:200 Int(rand(Bool)) end How it works… The first function is a simple Julia function that adds random bits with every loop iteration. It was created just for the demonstration of Map-Reduce operations. In the second point, we spawn two separate processes for executing the function and then fetch the results of both of them and add them up. However, that is not really a neat way to carry out parallel computation of functions and loops. Instead, the @parallel macro provides a better way to do it, which allows the user to parallelize the loop and then reduce the computations through an operator, which together would be called the Map-Reduce operation. Channels Channels are like the background plumbing for parallel computing in Julia. They are like the reservoirs from where the individual processes access their data from. Getting ready The requisite is similar to the previous sections. This is mostly a theoretical section, so you just need to run your experiments on your own. For that, you need to run your Julia REPL in a multiprocessing mode. How to do it… Channels are shared queues with a fixed length. They are common data reservoirs for the processes which are running. The channels are like common data resources, which multiple readers or workers can access. They can access the data through the fetch() function, which we already discussed in the previous sections. The workers can also write to the channel through the put!() function. This means that the workers can add more data to the resource, which can be accessed by all the workers running a particular computation. Closing a channel after usage is a good practice to avoid data corruption and unnecessary memory usage. It can be done using the close() function. Summary In this article we covered the basic concepts of parallel computing and data movement that takes place in the network. We also learned about parallel maps and loop operations along with the famous Map Reduce framework. At the end we got a brief understanding of channels and how individual processes access their data from channels. Resources for Article: Further resources on this subject: More about Julia [article] Basics of Programming in Julia [article] Simplifying Parallelism Complexity in C# [article]
Read more
  • 0
  • 0
  • 3370

article-image-deep-learning-torch
Preetham Sreenivas
29 Sep 2016
10 min read
Save for later

Deep Learning with Torch

Preetham Sreenivas
29 Sep 2016
10 min read
Torch is a scientific computing framework built on top of Lua[JIT]. The nn package and the ecosystem around it provide a very powerful framework for building deep learning models, striking a perfect balance between speed and flexibility. It is used at Facebook AI Research(FAIR), Twitter Cortex, DeepMind, Yann LeCun's group at NYU, Fei-Fei Li's at Stanford, and many more industrial and academic labs. If you are like me, and don't like writing equations for backpropagation every time you want to try a simple model, Torch is a great solution. With Torch, you can also do pretty much anything you can imagine, whether that is writing custom loss functions, dreaming up an arbitrary acyclic graph network, or even using multiple GPUs or loading pre-trained models on imagenet from caffe model-zoo (yes, you can load models trained in caffe with a single line). Without further ado, let's jump right into the awesome world of deep learning. Prerequisites Some knowledge of deep learning—A Primer, Bengio's deep learning book, Hinton's Coursera course. A bit of Lua. Its syntax is very C-like and can be picked up fairly quickly if you know Python or JavaScript—Learn Lua in 15 minutes, Torch For Numpy Users. A machine with Torch installed since this is intended to be hands-on. On Ubuntu 12+ and Mac OS X, installing Torch looks like this: # in a terminal, run the commands WITHOUT sudo $ git clone https://github.com/torch/distro.git ~/torch --recursive $ cd ~/torch; bash install-deps; $ ./install.sh # On Linux with bash $ source ~/.bashrc # On OSX or in Linux with no bash. $ source ~/.profile Once you’ve installed Torch, you can run a Torch script using: $ th script.lua # alternatively you can fire up a terminal torch interpreter using th -i $ th -i # and run multiple scripts one by one, the variables will be accessible to other scripts > dofile 'script1.lua' > dofile 'script2.lua' > print(variable) -- variable from either of these scripts. The sections below are very code intensive, but you can run these commands from Torch's terminal interpreter. $th -i Building a Model: The Basics A module is the basic building block of any Torch model. It has forward and backward methods for forward and backward passes of backpropagation. You can combine them using containers, and of course, calling forward and backward on containers propagates inputs and gradients correctly. -- A simple mlp model with sigmoids require 'nn' linear1 = nn.Linear(100,10) -- A linear layer Module linear2 = nn.Linear(10,2) -- You can combine modulues using containers, sequential is the most used one model = nn.Sequential() -- A container model:add(linear1) model:add(nn.Sigmoid()) model:add(linear2) model:add(nn.Sigmoid()) -- the forward step input = torch.rand(100) target = torch.rand(2) output = linear:forward(input) Now we need a criterion to measure how well our model is performing, in other words, a loss function. nn.Criterion is the abstract class that all loss functions inherit. It provides forward and backward methods, computing loss and gradients respectively. Torch provides most of the commonly used criterions out of the box. It isn't much of an effort to write your own either. criterion = nn.MSECriterioin() -- mean squared error criterion loss = criterion:forward(output,target) gradientsAtOutput = criterion:backward(output,target) -- To perform the backprop step, we need to pass these gradients to the backward -- method of the model gradAtInput = model:backward(input,gradientsAtOutput) lr = 0.1 -- learning rate for our model model:updateParameters(lr) -- updates the parameters using the lr parameter. The updateParameters method just subtracts the model parameters by gradients scaled by the learning rate. This is the vanilla stochastic gradient descent. Typically, the updates we do are more complex. For example, if we want to use momentum, we need to keep a track of updates we did in the previous epoch. There are a lot more fancy optimization schemes such as RMSProp, adam, adagrad, and L-BFGS that do more complex things like adapting learning rate, momentum factor, and so on. The optim package provides optimization routines out of the box. Dataset We'll use the German Traffic Sign Recognition Benchmark(GTSRB) dataset. This dataset has 43 classes of traffic signs of varying sizes, illuminations and occlusions. There are 39,000 training images and 12,000 test images. Traffic signs in each of the images are not centered and they have a 10% border around them. I have included a shell script for downloading the data along with the code for this tutorial in this github repo.[1] git clone https://github.com/preethamsp/tutorial.gtsrb.torch.git cd tutorial.gtsrb.torch/datasets bash download_gtsrb.sh Model Let's build a downsized vgg style model with what we've learned. function createModel() require 'nn' nbClasses = 43 local net = nn.Sequential() --[[building block: adds a convolution layer, batch norm layer and a relu activation to the net]]-- function ConvBNReLU(nInputPlane, nOutputPlane) The code in the repo is much more polished than the snippets in the tutorial. It is modular and allows you to change the model and/or datasets easily. -- kernel size = (3,3), stride = (1,1), padding = (1,1) net:add(nn.SpatialConvolution(nInputPlane, nOutputPlane, 3,3, 1,1, 1,1)) net:add(nn.SpatialBatchNormalization(nOutputPlane,1e-3)) net:add(nn.ReLU(true)) end ConvBNReLU(3,32) ConvBNReLU(32,32) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) ConvBNReLU(32,64) ConvBNReLU(64,64) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) ConvBNReLU(64,128) ConvBNReLU(128,128) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) net:add(nn.View(128*6*6)) net:add(nn.Dropout(0.5)) net:add(nn.Linear(128*6*6,512)) net:add(nn.BatchNormalization(512)) net:add(nn.ReLU(true)) net:add(nn.Linear(512,nbClasses)) net:add(nn.LogSoftMax()) return net end The first layer contains three input channels because we're going to pass RGB images (three channels). For grayscale images, the first layer has one input channel. I encourage you to play around and modify the network.[2] There are a bunch of new modules that need some elaboration. The Dropout module randomly deactivates a neuron with some probability. It is known to help generalization by preventing co-adaptation between neurons; that is, a neuron should now depend less on its peer, forcing it to learn a bit more. BatchNormalization is a very recent development. It is known to speed up convergence by normalizing the outputs of a layer to unit gaussian using the statistics of a batch. Let’s use this model and train it. In the interest of brievity, I'll use these constructs directly. The code describing these constructs is in datasets/gtsrb.lua. DataGen:trainGenerator(batchSize) DataGen:valGenerator(batchSize) These provide iterators over batches of train and test data respectively. You'll find that the model code (models/vgg_small.lua) in the repo is different. It is designed to allow you to experiment quickly. Using optim to train the model Using a stochastic gradient descent (sgd) from the optim package to minimize a function f looks like this: optim.sgd(feval, params, optimState) Where: feval: A user-defined function that respects the API: f, df/params = feval(params) params: The current parameter vector (a 1D torch.Tensor) optimState: A table of parameters, and state variables, dependent upon the algorithm Since we are optimizing the loss of the neural network, parameters should be the weights and other parameters of the network. We get these as a flattened 1D tensor using model:getParameters. It also returns a tensor containing the gradients of these parameters. This is useful in creating the feval function above. model = createModel() criterion = nn.ClassNLLCriterion() -- criterion we are optimizing: negative log loss params, gradParams = model:getParameters() local function feval() -- criterion.output stores the latest output of criterion return criterion.output, gradParams end We need to create an optimState table and initialize it with a configuration of our optimizer like learning rate and momentum: optimState = { learningRate = 0.01, momentum = 0.9, dampening = 0.0, nesterov = true, } Now, an update to the model should do the following: Compute the output of the model using model:forward(). Compute the loss and the gradients at output layer using criterion:forward() and criterion:backward() respectively. Update the gradients of the model parameters using model:backward(). Update the model using optim.sgd. -- Forward pass output = model:forward(input) loss = criterion:forward(output, target) -- Backward pass critGrad = criterion:backward(output, target) model:backward(input, critGrad) -- Updates optim.sgd(feval, params, optimState) Note: The order above should be respected, as backward assumes forward was run just before it. Changing this order might result in gradients not being computed correctly. Putting it all together Let's put it all together and write a function that trains the model for an epoch. We'll create a loop that iterates over the train data in batches and updates the model. model = createModel() criterion = nn.ClassNLLCriterion() dataGen = DataGen('datasets/GTSRB/') -- Data generator params, gradParams = model:getParameters() batchSize = 32 optimState = { learningRate = 0.01, momentum = 0.9, dampening = 0.0, nesterov = true, } function train() -- Dropout and BN behave differently during training and testing -- So, switch to training mode model:training() local function feval() return criterion.output, gradParams end for input, target in dataGen:trainGenerator(batchSize) do -- Forward pass local output = model:forward(input) local loss = criterion:forward(output, target) -- Backward pass model:zeroGradParameters() -- clear grads from previous update local critGrad = criterion:backward(output, target) model:backward(input, critGrad) -- Updates optim.sgd(feval, params, optimState) end end The test function is extremely similar, except that we don't need to update the parameters: confusion = optim.ConfusionMatrix(nbClasses) -- to calculate accuracies function test() model:evaluate() -- switch to evaluate mode confusion:zero() -- clear confusion matrix for input, target in dataGen:valGenerator(batchSize) do local output = model:forward(input) confusion:batchAdd(output, target) end confusion:updateValids() local test_acc = confusion.totalValid * 100 print(('Test accuracy: %.2f'):format(test_acc)) end Now that everything is set, you can train your network and print the test accuracies: max_epoch = 20 for i = 1,20 do train() test() end An epoch takes around 30 seconds on a TitanX and gives about 97.7% accuracy after 20 epochs. This is a very basic model and honestly I haven't tried optimizing the parameters much. There are a lot of things that can be done to crank up the accuracies. Try different processing procedures. Experiment with the net structure. Different weight initializations, and learning rate schedules. An Ensemble of different models; for example, train multiple models and take a majority vote. You can have a look at the state of the art on this dataset here. They achieve upwards of 99.5% accuracy using a clever method to boost the geometric variation of CNNs. Conclusion We looked at how to build a basic mlp in Torch. We then moved on to building a Convolutional Neural Network and trained it to solve a real-world problem of traffic sign recognition. For a beginner, Torch/LUA might not be as easy. But once you get a hang of it, you have access to a deep learning framework which is very flexible yet fast. You will be able to easily reproduce latest research or try new stuff unlike in rigid frameworks like keras or nolearn. I encourage you to give it a fair try if you are going anywhere near deep learning. Resources Torch Cheat Sheet Awesome Torch Torch Blog Facebook's Resnet Code Oxford's ML Course Practicals Learn torch from Github repos About the author Preetham Sreenivas is a data scientist at Fractal Analytics. Prior to that, he was a software engineer at Directi.
Read more
  • 0
  • 0
  • 11302
Modal Close icon
Modal Close icon