Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-getting-started-python-packages

02 Nov 2016

37 min read

Getting Started with Python Packages

02 Nov 2016

0
0
26199

Packt

21 Oct 2016

15 min read

The Data Science Venn Diagram

Packt

21 Oct 2016

15 min read

It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. In this article by Sinan Ozdemir, author of the book Principles of Data Science, we will discuss how data science begins with three basic areas: Math/statistics: This is the use of equations and formulas to perform analysis Computer programming: This is the ability to use code to create outcomes on the computer Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, and so on) (For more resources related to this topic, see here.) The following Venn diagram provides a visual representation of how the three areas of data science intersect: The Venn diagram of data science Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics knowledge base allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive (domain) expertise allows you to apply concepts and results in a meaningful and effective way. While having only two of these three qualities can make you intelligent, it will also leave a gap. Consider that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place but lack the math skills to evaluate your algorithms and, therefore, end up losing money in the long run. It is only when you can boast skills in coding, math, and domain knowledge, can you truly perform data science. The one that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers. Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses' place in the domain we are in. This includes presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist. Also, note that the intersection of math and coding is machine learning, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just as algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data but if you don't understand how to apply this model in a practical sense such that doctors and nurses can easily use it, your model might be useless. Domain knowledge comes with both practice of data science and reading examples of other people's analyses. The math Most people stop listening once someone says the word "math". They'll nod along in an attempt to hide their utter disdain for the topic. We will use these subdomains of mathematics to create what are called models. A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon. Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding theory allows us to apply a model that we built for the fashion industry to a financial model. Every mathematical concept I introduce, I do so with care, examples, and purpose. The math in this article is essential for data scientists. Example – Spawner-Recruit Models In biology, we use, among many others, a model known as the Spawner-Recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the following graph was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that group would obtain and vice versa? Essentially, models allow us to plug in one variable to get the other. Consider the following example: In this example, let's say we knew that a group of salmons had 1.15 (in thousands) of spawners. Then, we would have the following: This result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change. There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the "best" model possible. We no longer rely on human instincts, rather, we rely on data. Spawner-Recruit model visualized The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible. Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere. Computer programming Let's be honest. You probably think computer science is way cooler than math. That's ok, I don't blame you. The news isn't filled with math news like it is with news on the technological front. You don't turn on the TV to see a new theory on primes, rather you will see investigative reports on how the latest smartphone can take photos of cats better or something. Computer languages are how we communicate with the machine and tell it to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages available to us. This article will focus exclusively on using Python. Why Python? We will use Python for a variety of reasons: Python is an extremely simple language to read and write even if you've coded before, which will make future examples easy to ingest and read later. It is one of the most common languages in production and in the academic setting (one of the fastest growing as a matter of fact). The online community of the language is vast and friendly. This means that a quick Google search should yield multiple results of people who have faced and solved similar (if not exact) situations. Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize. The last is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful but also easy to pick up. Some of these modules are as follows: pandas sci-kit learn seaborn numpy/scipy requests (to mine data from the web) BeautifulSoup (for Web HTML parsing) Python practices Before we move on, it is important to formalize many of the requisite coding skills in Python. In Python, we have variables thatare placeholders for objects. We will focus on only a few types of basic objects at first: int (an integer) Examples: 3, 6, 99, -34, 34, 11111111 float (a decimal): Examples: 3.14159, 2.71, -0.34567 boolean (either true or false) The statement, Sunday is a weekend, is true The statement, Friday is a weekend, is false The statement, pi is exactly the ratio of a circle's circumference to its diameter, is true (crazy, right?) string (text or words made up of characters) I love hamburgers (by the way who doesn't?) Matt is awesome A Tweet is a string a list (a collection of objects) Example: 1, 5.4, True, "apple" We will also have to understand some basic logistical operators. For these operators, keep the boolean type in mind. Every operator will evaluate to either true or false. == evaluates to true if both sides are equal, otherwise it evaluates to false 3 + 4 == 7 (will evaluate to true) 3 – 2 == 7 (will evaluate to false) < (less than) 3 < 5 (true) 5 < 3 (false) <= (less than or equal to) 3 <= 3 (true) 5 <= 3 (false) > (greater than) 3 > 5 (false) 5 > 3 (true) >= (greater than or equal to) 3 >= 3 (true) 5 >= 3 (false) When coding in Python, I will use a pound sign (#) to create a comment, which will not be processed as code but is merely there to communicate with the reader. Anything to the right of a # is a comment on the code being executed. Example of basic Python In Python, we use spaces/tabs to denote operations that belong to other lines of code. Note the use of the if statement. It means exactly what you think it means. When the statement after the if statement is true, then the tabbed part under it will be executed, as shown in the following code: X = 5.8 Y = 9.5 X + Y == 15.3 # This is True! X - Y == 15.3 # This is False! if x + y == 15.3: # If the statement is true: print "True!" # print something! The print "True!" belongs to the if x + y == 15.3: line preceding it because it is tabbed right under it. This means that the print statement will be executed if and only if x + y equals 15.3. Note that the following list variable, my_list, can hold multiple types of objects. This one has an int, a float, boolean, and string (in that order): my_list = [1, 5.7, True, "apples"] len(my_list) == 4 # 4 objects in the list my_list[0] == 1 # the first object my_list[1] == 5.7 # the second object In the preceding code: I used the len command to get the length of the list (which was four). Note the zero-indexing of Python. Most computer languages start counting at zero instead of one. So if I want the first element, I call the index zero and if I want the 95th element, I call the index 94. Example – parsing a single Tweet Here is some more Python code. In this example, I will be parsing some tweets about stock prices: tweet = "RT @j_o_n_dnger: $TWTR now top holding for Andor, unseating $AAPL" words_in_tweet = first_tweet.split(' ') # list of words in tweet for word in words_in_tweet: # for each word in list if "$" in word: # if word has a "cashtag" print "THIS TWEET IS ABOUT", word # alert the user I will point out a few things about this code snippet, line by line, as follows: We set a variable to hold some text (known as a string in Python). In this example, the tweet in question is "RT @robdv: $TWTR now top holding for Andor, unseating $AAPL" The words_in_tweet variable "tokenizes" the tweet (separates it by word). If you were to print this variable, you would see the following: "['RT', '@robdv:', '$TWTR', 'now', 'top', 'holding', 'for', 'Andor,', 'unseating', '$AAPL'] We iterate through this list of words. This is called a for loop. It just means that we go through a list one by one. Here, we have another if statement. For each word in this tweet, if the word contains the $ character (this is how people reference stock tickers on twitter). If the preceding if statement is true (that is, if the tweet contains a cashtag), print it and show it to the user. The output of this code will be as follows: We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this article, I will ensure that I am as explicit as possible about what I am doing in each line of code. Domain knowledge As I mentioned earlier, this category focuses mainly on having knowledge about the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field. Does that mean that if you're not a doctor, you can't work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren't fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete. A big part of domain knowledge is presentation. Depending on your audience, it can greatly matter how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused. Some more terminology This is a good time to define some more vocabulary. By this point, you're probably excitedly looking up a lot of data science material and seeing words and phrases I haven't used yet. Here are some common terminologies you are likely to come across: Machine learning: This refers to giving computers the ability to learn from data without explicit "rules" being given by a programmer. Machine learning combines the power of computers with intelligent learning algorithms in order to automate the discovery of relationships in data and creation of powerful data models. Speaking of data models, we will concern ourselves with the following two basic types of data models: Probabilistic model: This refers to using probability to find a relationship between elements that includes a degree of randomness Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula While both the statistical and probabilistic models can be run on computers and might be considered machine learning in that regard, we will keep these definitions separate as machine learning algorithms generally attempt to learn relationships in different ways. Exploratory data analysis – This refers to preparing data in order to standardize results and gain quick insights Exploratory data analysis (EDA) is concerned with data visualization and preparation. This is where we turn unorganized data into organized data and also clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots in order to identify key features and relationships to exploit in our data models. Data mining – This is the process of finding relationships between elements of data. Data mining is the part of Data science where we try to find relationships between variables (think spawn-recruit model). I tried pretty hard not to use the term big data up until now. It's because I think this term is misused, a lot. While the definition of this word varies from person to person. Big datais data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data). The state of data science so far (this diagram is incomplete and is meant for visualization purposes only). Summary More and more people are jumping headfirst into the field of data science, most with no prior experience in math or CS, which on the surface is great. Average data scientists have access to millions of dating profiles' data, tweets, online reviews, and much more in order to jumpstart their education. However, if you jump into data science without the proper exposure to theory or coding practices and without respect of the domain you are working in, you face the risk of oversimplifying the very phenomenon you are trying to model. Resources for Article: Further resources on this subject: Reconstructing 3D Scenes [article] Basics of Classes and Objects [article] Saying Hello! [article]

0
0
12459

article-image-jupyter-and-python-scripting

Packt

21 Oct 2016

9 min read

Jupyter and Python Scripting

Packt

21 Oct 2016

9 min read

In this article by Dan Toomey, author of the book Learning Jupyter, we will see data access in Jupyter with Python and the effect of pandas on Jupyter. We will also see Python graphics and lastly Python random numbers. (For more resources related to this topic, see here.) Python data access in Jupyter I started a view for pandas using Python Data Access as the name. We will read in a large dataset and compute some standard statistics on the data. We are interested in seeing how we use pandas in Jupyter, how well the script performs, and what information is stored in the metadata (especially if it is a larger dataset). Our script accesses the iris dataset built into one of the Python packages. All we are looking to do is read in a slightly large number of items and calculate some basic operations on the dataset. We are really interested in seeing how much of the data is cached in the PYNB file. The Python code is: # import the datasets package from sklearn import datasets # pull in the iris data iris_dataset = datasets.load_iris() # grab the first two columns of data X = iris_dataset.data[:, :2] # calculate some basic statistics x_count = len(X.flat) x_min = X[:, 0].min() - .5 x_max = X[:, 0].max() + .5 x_mean = X[:, 0].mean() # display our results x_count, x_min, x_max, x_mean I broke these steps into a couple of cells in Jupyter, as shown in the following screenshot: Now, run the cells (using Cell | Run All) and you get this display below. The only difference is the last Out line where our values are displayed. It seemed to take longer to load the library (the first time I ran the script) than to read the data and calculate the statistics. If we look in the PYNB file for this notebook, we see that none of the data is cached in the PYNB file. We simply have code references to the library, our code, and the output from when we last calculated the script: { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(300, 3.7999999999999998, 8.4000000000000004, 5.8433333333333337)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate some basic statisticsn", "x_count = len(X.flat)n", "x_min = X[:, 0].min() - .5n", "x_max = X[:, 0].max() + .5n", "x_mean = X[:, 0].mean()n", "n", "# display our resultsn", "x_count, x_min, x_max, x_mean" ] } Python pandas in Jupyter One of the most widely used features of Python is pandas. pandas are built-in libraries of data analysis packages that can be used freely. In this example, we will develop a Python script that uses pandas to see if there is any effect to using them in Jupyter. I am using the Titanic dataset from http://www.kaggle.com/c/titanic-gettingStarted/download/train.csv. I am sure the same data is available from a variety of sources. Here is our Python script that we want to run in Jupyter: from pandas import * training_set = read_csv('train.csv') training_set.head() male = training_set[training_set.sex == 'male'] female = training_set[training_set.sex =='female'] womens_survival_rate = float(sum(female.survived))/len(female) mens_survival_rate = float(sum(male.survived))/len(male) The result is… we calculate the survival rates of the passengers based on sex. We create a new notebook, enter the script into appropriate cells, include adding displays of calculated data at each point and produce our results. Here is our notebook laid out where we added displays of calculated data at each cell,as shown in the following screenshot: When I ran this script, I had two problems: On Windows, it is common to use backslash ("") to separate parts of a filename. However, this coding uses the backslash as a special character. So, I had to change over to use forward slash ("/") in my CSV file path. I originally had a full path to the CSV in the above code example. The dataset column names are taken directly from the file and are case sensitive. In this case, I was originally using the 'sex' field in my script, but in the CSV file the column is named Sex. Similarly I had to change survived to Survived. The final script and result looks like the following screenshot when we run it: I have used the head() function to display the first few lines of the dataset. It is interesting… the amount of detail that is available for all of the passengers. If you scroll down, you see the results as shown in the following screenshot: We see that 74% of the survivors were women versus just 19% men. I would like to think chivalry is not dead! Curiously the results do not total to 100%. However, like every other dataset I have seen, there is missing and/or inaccurate data present. Python graphics in Jupyter How do Python graphics work in Jupyter? I started another view for this named Python Graphics so as to distinguish the work. If we were to build a sample dataset of baby names and the number of births in a year of that name, we could then plot the data. The Python coding is simple: import pandas import matplotlib %matplotlib inline baby_name = ['Alice','Charles','Diane','Edward'] number_births = [96, 155, 66, 272] dataset = list(zip(baby_name,number_births)) df = pandas.DataFrame(data = dataset, columns=['Name', 'Number']) df['Number'].plot() The steps of the script are as follows: We import the graphics library (and data library) that we need Define our data Convert the data into a format that allows for easy graphical display Plot the data We would expect a resultant graph of the number of births by baby name. Taking the above script and placing it into cells of our Jupyter node, we get something that looks like the following screenshot: I have broken the script into different cells for easier readability. Having different cells also allows you to develop the script easily step by step, where you can display the values computed so far to validate your results. I have done this in most of the cells by displaying the dataset and DataFrame at the bottom of those cells. When we run this script (Cell | Run All), we see the results at each step displayed as the script progresses: And finally we see our plot of the births as shown in the following screenshot. I was curious what metadata was stored for this script. Looking into the IPYNB file, you can see the expected value for the formula cells. The tabular data display of the DataFrame is stored as HTML—convenient: { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "<div>n", "<table border="1" class="dataframe">n", "<thead>n", "<tr style="text-align: right;">n", "<th></th>n", "<th>Name</th>n", "<th>Number</th>n", "</tr>n", "</thead>n", "<tbody>n", "<tr>n", "<th>0</th>n", "<td>Alice</td>n", "<td>96</td>n", "</tr>n", "<tr>n", "<th>1</th>n", "<td>Charles</td>n", "<td>155</td>n", "</tr>n", "<tr>n", "<th>2</th>n", "<td>Diane</td>n", "<td>66</td>n", "</tr>n", "<tr>n", "<th>3</th>n", "<td>Edward</td>n", "<td>272</td>n", "</tr>n", "</tbody>n", "</table>n", "</div>" ], "text/plain": [ " Name Numbern", "0 Alice 96n", "1 Charles 155n", "2 Diane 66n", "3 Edward 272" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], The graphic output cell that is stored like this: { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x47cf8f0>" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "<a few hundred lines of hexcodes> …/wc/B0RRYEH0EQAAAABJRU5ErkJggg==n", "text/plain": [ "<matplotlib.figure.Figure at 0x47d8e30>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the datan", "df['Number'].plot()n" ] } ], Where the image/png tag contains a large hex digit string representation of the graphical image displayed on screen (I abbreviated the display in the coding shown). So, the actual generated image is stored in the metadata for the page. Python random numbers in Jupyter For many analyses we are interested in calculating repeatable results. However, much of the analysis relies on some random numbers to be used. In Python, you can set the seed for the random number generator to achieve repeatable results with the random_seed() function. In this example, we simulate rolling a pair of dice and looking at the outcome. We would example the average total of the two dice to be 6—the halfway point between the faces. The script we are using is this: import pylab import random random.seed(113) samples = 1000 dice = [] for i in range(samples): total = random.randint(1,6) + random.randint(1,6) dice.append(total) pylab.hist(dice, bins= pylab.arange(1.5,12.6,1.0)) pylab.show() Once we have the script in Jupyter and execute it, we have this result: I had added some more statistics. Not sure if I would have counted on such a high standard deviation. If we increased the number of samples, this would decrease. The resulting graph was opened in a new window, much as it would if you ran this script in another Python development environment. The toolbar at the top of the graphic is extensive, allowing you to manipulate the graphic in many ways. Summary In this article, we walked through simple data access in Jupyter through Python. Then we saw an example of using pandas. We looked at a graphics example. Finally, we looked at an example using random numbers in a Python script. Resources for Article: Further resources on this subject: Python Data Science Up and Running [article] Mining Twitter with Python – Influence and Engagement [article] Unsupervised Learning [article]

0
0
34017

article-image-heart-diseases-prediction-using-spark-200

Packt

18 Oct 2016

16 min read

Heart Diseases Prediction using Spark 2.0.0

Packt

18 Oct 2016

16 min read

0
0
4852

article-image-diving-data-search-and-report

Packt

17 Oct 2016

11 min read

Diving into Data – Search and Report

Packt

17 Oct 2016

11 min read

In this article by Josh Diakun, Paul R Johnson, and Derek Mock authors of the books Splunk Operational Intelligence Cookbook - Second Edition, we will cover the basic ways to search the data in Splunk. We will cover how to make raw event data readable (For more resources related to this topic, see here.) The ability to search machine data is one of Splunk's core functions, and it should come as no surprise that many other features and functions of Splunk are heavily driven-off searches. Everything from basic reports and dashboards to data models and fully featured Splunk applications are powered by Splunk searches behind the scenes. Splunk has its own search language known as the Search Processing Language (SPL). This SPL contains hundreds of search commands, most of which also have several functions, arguments, and clauses. While a basic understanding of SPL is required in order to effectively search your data in Splunk, you are not expected to know all the commands! Even the most seasoned ninjas do not know all the commands and regularly refer to the Splunk manuals, website, or Splunk Answers (http://answers.splunk.com). To get you on your way with SPL, be sure to check out the search command cheat sheet and download the handy quick reference guide available at http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/SplunkEnterpriseQuickReferenceGuide. Searching Searches in Splunk usually start with a base search, followed by a number of commands that are delimited by one or more pipe (|) characters. The result of a command or search to the left of the pipe is used as the input for the next command to the right of the pipe. Multiple pipes are often found in a Splunk search to continually refine data results as needed. As we go through this article, this concept will become very familiar to you. Splunk allows you to search for anything that might be found in your log data. For example, the most basic search in Splunk might be a search for a keyword such as error or an IP address such as 10.10.12.150. However, searching for a single word or IP over the terabytes of data that might potentially be in Splunk is not very efficient. Therefore, we can use the SPL and a number of Splunk commands to really refine our searches. The more refined and granular the search, the faster the time to run and the quicker you get to the data you are looking for! When searching in Splunk, try to filter as much as possible before the first pipe (|) character, as this will save CPU and disk I/O. Also, pick your time range wisely. Often, it helps to run the search over a small time range when testing it and then extend the range once the search provides what you need. Boolean operators There are three different types of Boolean operators available in Splunk. These are AND, OR, and NOT. Case sensitivity is important here, and these operators must be in uppercase to be recognized by Splunk. The AND operator is implied by default and is not needed, but does no harm if used. For example, searching for the term error or success would return all the events that contain either the word error or the word success. Searching for error success would return all the events that contain the words error and success. Another way to write this can be error AND success. Searching web access logs for error OR success NOT mozilla would return all the events that contain either the word error or success, but not those events that also contain the word mozilla. Common commands There are many commands in Splunk that you will likely use on a daily basis when searching data within Splunk. These common commands are outlined in the following table: Command Description chart/timechart This command outputs results in a tabular and/or time-based output for use by Splunk charts. dedup This command de-duplicates results based upon specified fields, keeping the most recent match. eval This command evaluates new or existing fields and values. There are many different functions available for eval. fields This command specifies the fields to keep or remove in search results. head This command keeps the first X (as specified) rows of results. lookup This command looks up fields against an external source or list, to return additional field values. rare This command identifies the least common values of a field. rename This command renames the fields. replace This command replaces the values of fields with another value. search This command permits subsequent searching and filtering of results. sort This command sorts results in either ascending or descending order. stats This command performs statistical operations on the results. There are many different functions available for stats. table This command formats the results into a tabular output. tail This command keeps only the last X (as specified) rows of results. top This command identifies the most common values of a field. transaction This command merges events into a single event based upon a common transaction identifier. Time modifiers The drop-down time range picker in the Graphical User Interface (GUI) to the right of the Splunk search bar allows users to select from a number of different preset and custom time ranges. However, in addition to using the GUI, you can also specify time ranges directly in your search string using the earliest and latest time modifiers. When a time modifier is used in this way, it automatically overrides any time range that might be set in the GUI time range picker. The earliest and latest time modifiers can accept a number of different time units: seconds (s), minutes (m), hours (h), days (d), weeks (w), months (mon), quarters (q), and years (y). Time modifiers can also make use of the @ symbol to round down and snap to a specified time. For example, searching for sourcetype=access_combined earliest=-1d@d latest=-1h will search all the access_combined events from midnight, a day ago until an hour ago from now. Note that the snap (@) will round down such that if it were 12 p.m. now, we would be searching from midnight a day and a half ago until 11 a.m. today. Working with fields Fields in Splunk can be thought of as keywords that have one or more values. These fields are fully searchable by Splunk. At a minimum, every data source that comes into Splunk will have the source, host, index, and sourcetype fields, but some source might have hundreds of additional fields. If the raw log data contains key-value pairs or is in a structured format such as JSON or XML, then Splunk will automatically extract the fields and make them searchable. Splunk can also be told how to extract fields from the raw log data in the backend props.conf and transforms.conf configuration files. Searching for specific field values is simple. For example, sourcetype=access_combined status!=200 will search for events with a sourcetype field value of access_combined that has a status field with a value other than 200. Splunk has a number of built-in pre-trained sourcetypes that ship with Splunk Enterprise that might work with out-of-the-box, common data sources. These are available at http://docs.splunk.com/Documentation/Splunk/latest/Data/Listofpretrainedsourcetypes. In addition, Technical Add-Ons (TAs), which contain event types and field extractions for many other common data sources such as Windows events, are available from the Splunk app store at https://splunkbase.splunk.com. Saving searches Once you have written a nice search in Splunk, you may wish to save the search so that you can use it again at a later date or use it for a dashboard. Saved searches in Splunk are known as Reports. To save a search in Splunk, you simply click on the Save As button on the top right-hand side of the main search bar and select Report. Making raw event data readable When a basic search is executed in Splunk from the search bar, the search results are displayed in a raw event format by default. To many users, this raw event information is not particularly readable, and valuable information is often clouded by other less valuable data within the event. Additionally, if the events span several lines, only a few events can be seen on the screen at any one time. In this recipe, we will write a Splunk search to demonstrate how we can leverage Splunk commands to make raw event data readable, tabulating events and displaying only the fields we are interested in. Getting ready You should be familiar with the Splunk search bar and search results area. How to do it… Follow the given steps to search and tabulate the selected event data: Log in to your Splunk server. Select the Search & Reporting application from the drop-down menu located in the top left-hand side of the screen. Set the time range picker to Last 24 hours and type the following search into the Splunk search bar: index=main sourcetype=access_combined Then, click on Search or hit Enter. Splunk will return the results of the search and display the raw search events under the search bar. Let's rerun the search, but this time we will add the table command as follows: index=main sourcetype=access_combined | table _time, referer_domain, method, uri_path, status, JSESSIONID, useragent Splunk will now return the same number of events, but instead of presenting the raw events to you, the data will be in a nicely formatted table, displaying only the fields we specified. This is much easier to read! Save this search by clicking on Save As and then on Report. Give the report the name cp02_tabulated_webaccess_logs and click on Save. On the next screen, click on Continue Editing to return to the search. How it works… Let's break down the search piece by piece: Search fragment Description index=main All the data in Splunk is held in one or more indexes. While not strictly necessary, it is a good practice to specify the index (es) to search, as this will ensure a more precise search. sourcetype=access_combined This tells Splunk to search only the data associated with the access_combined sourcetype, which, in our case, is the web access logs. | table _time, referer_domain, method, uri_path, action, JSESSIONID, useragent Using the table command, we take the result of our search to the left of the pipe and tell Splunk to return the data in a tabular format. Splunk will only display the fields specified after the table command in the table of results. In this recipe, you used the table command. The table command can have a noticeable performance impact on large searches. It should be used towards the end of a search, once all the other processing on the data by the other Splunk commands has been performed. The stats command is more efficient than the table command and should be used in place of table where possible. However, be aware that stats and table are two very different commands. There's more… The table command is very useful in situations where we wish to present data in a readable format. Additionally, tabulated data in Splunk can be downloaded as a CSV file, which many users find useful for offline processing in spreadsheet software or for sending to others. There are some other ways we can leverage the table command to make our raw event data readable. Tabulating every field Often, there are situations where we want to present every event within the data in a tabular format, without having to specify each field one by one. To do this, we simply use a wildcard (*) character as follows: index=main sourcetype=access_combined | table * Removing fields, then tabulating everything else While tabulating every field using the wildcard (*) character is useful, you will notice that there are a number of Splunk internal fields, such as _raw, that appear in the table. We can use the fields command before the table command to remove the fields as follows: index=main sourcetype=access_combined | fields - sourcetype, index, _raw, source date* linecount punct host time* eventtype | table * If we do not include the minus (-) character after the fields command, Splunk will keep the specified fields and remove all the other fields. Summary In this article we covered along with the introduction to Splunk, how to make raw event data readable Resources for Article: Further resources on this subject: Splunk's Input Methods and Data Feeds [Article] The Splunk Interface [Article] The Splunk Web Framework [Article]

0
0
1170

article-image-solving-nlp-problem-keras-part-2

Sasank Chilamkurthy

13 Oct 2016

6 min read

Solving an NLP Problem with Keras, Part 2

Sasank Chilamkurthy

13 Oct 2016

6 min read

In this two-part post series, we are solving a Natural Language Processing (NLP) problem with Keras. In Part 1, we covered the problem and the ATIS dataset we are using. We also went over the word embeddings (mapping words to a vector) along with Recurrent Neural Networks that solve complicated word tagging problems. We passed the word embedding sequence as input into the RNN and we then started coding that up. Now, it is time in this post to start loading the data. Loading Data Let's load the data using data.load.atisfull(). It will download the data the first time it is run. Words and labels are encoded as indexes to a vocabulary. This vocabulary is stored in w2idx and labels2idx. import numpy as np import data.load train_set, valid_set, dicts = data.load.atisfull() w2idx, labels2idx = dicts['words2idx'], dicts['labels2idx'] train_x, _, train_label = train_set val_x, _, val_label = valid_set # Create index to word/label dicts idx2w = {w2idx[k]:k for k in w2idx} idx2la = {labels2idx[k]:k for k in labels2idx} # For conlleval script words_train = [ list(map(lambda x: idx2w[x], w)) for w in train_x] labels_train = [ list(map(lambda x: idx2la[x], y)) for y in train_label] words_val = [ list(map(lambda x: idx2w[x], w)) for w in val_x] labels_val = [ list(map(lambda x: idx2la[x], y)) for y in val_label] n_classes = len(idx2la) n_vocab = len(idx2w) Let's print an example sentence and label: print("Example sentence : {}".format(words_train[0])) print("Encoded form: {}".format(train_x[0])) print() print("It's label : {}".format(labels_train[0])) print("Encoded form: {}".format(train_label[0])) Here is the output: Example sentence : ['i', 'want', 'to', 'fly', 'from', 'boston', 'at', 'DIGITDIGITDIGIT', 'am', 'and', 'arrive', 'in', 'denver', 'at', 'DIGITDIGITDIGITDIGIT', 'in', 'the', 'morning'] Encoded form: [232 542 502 196 208 77 62 10 35 40 58 234 137 62 11 234 481 321] It's label : ['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-depart_time.time', 'I-depart_time.time', 'O', 'O', 'O', 'B-toloc.city_name', 'O', 'B-arrive_time.time', 'O', 'O', 'B-arrive_time.period_of_day'] Encoded form: [126 126 126 126 126 48 126 35 99 126 126 126 78 126 14 126 126 12] Keras model Next, we define the Keras model. Keras has an inbuilt Embedding layer for word embeddings. It expects integer indices. SimpleRNN is the recurrent neural network layer described in Part 1. We will have to use TimeDistributed to pass the output of RNN Ot At each time step: t To a fully connected layer. Otherwise, the output at the final time step will be passed on to the next layer. from keras.models import Sequential from keras.layers.embeddings import Embedding from keras.layers.recurrent import SimpleRNN from keras.layers.core import Dense, Dropout from keras.layers.wrappers import TimeDistributed from keras.layers import Convolution1D model = Sequential() model.add(Embedding(n_vocab,100)) model.add(Dropout(0.25)) model.add(SimpleRNN(100,return_sequences=True)) model.add(TimeDistributed(Dense(n_classes, activation='softmax'))) model.compile('rmsprop', 'categorical_crossentropy') Training Now, let's start training our model. We will pass each sentence as a batch to the model. We cannot use model.fit() because it expects all of the sentences to be the same size. We will therefore use model.train_on_batch(). Training is very fast, since the dataset is relatively small. Each epoch takes 20 seconds on my Macbook Air. import progressbar n_epochs = 30 for i in range(n_epochs): print("Training epoch {}".format(i)) bar = progressbar.ProgressBar(max_value=len(train_x)) for n_batch, sent in bar(enumerate(train_x)): label = train_label[n_batch] # Make labels one hot label = np.eye(n_classes)[label][np.newaxis,:] # View each sentence as a batch sent = sent[np.newaxis,:] if sent.shape[1] >1: #ignore 1 word sentences model.train_on_batch(sent, label) Evaluation To measure the accuracy of the model, we use model.predict_on_batch() and metrics.accuracy.conlleval(). from metrics.accuracy import conlleval labels_pred_val = [] bar = progressbar.ProgressBar(max_value=len(val_x)) for n_batch, sent in bar(enumerate(val_x)): label = val_label[n_batch] label = np.eye(n_classes)[label][np.newaxis,:] sent = sent[np.newaxis,:] pred = model.predict_on_batch(sent) pred = np.argmax(pred,-1)[0] labels_pred_val.append(pred) labels_pred_val = [ list(map(lambda x: idx2la[x], y)) for y in labels_pred_val] con_dict = conlleval(labels_pred_val, labels_val, words_val, 'measure.txt') print('Precision = {}, Recall = {}, F1 = {}'.format( con_dict['r'], con_dict['p'], con_dict['f1'])) With this model, I get a 92.36 F1 Score. Precision = 92.07, Recall = 92.66, F1 = 92.36 Note that for the sake of brevity, I've not shown the logging part of the code. Loggging losses and accuracies are an important part of coding up an model. An improved model (described in the next section) with logging is at main.py. You can run it as : $ python main.py Improvements One drawback with our current model is that there is no look ahead, that is, output: ot This depends only on the current and previous words, but not on the words next to it. You can imagine clues about the properties of the current word that are also held by the next word. Lookahead can easily be implemented by having a convolutional layer before RNN and word embeddings: model = Sequential() model.add(Embedding(n_vocab,100)) model.add(Convolution1D(128, 5, border_mode='same', activation='relu')) model.add(Dropout(0.25)) model.add(GRU(100,return_sequences=True)) model.add(TimeDistributed(Dense(n_classes, activation='softmax'))) model.compile('rmsprop', 'categorical_crossentropy') With this improved model, I get a 94.90F1 Score! Conclusion In this two-part post series, you learned about word embeddings and RNNs. We applied these to an NLP problem: ATIS. We also made an improvement to our model. To improve the model further, you can try using word embeddings learned on a large site like Wikipedia. Also, there are variants of RNNs such as LSTM or GRU that can be experimented with. About the author Sasank Chilamkurthy works at Fractal Analytics. His work involves deep learning on medical images obtained from radiology and pathology. He is mainly interested in computer vision.

0
0
3262

Packt

13 Oct 2016

30 min read

Spark for Beginners

Packt

13 Oct 2016

30 min read

0
0
2463

Packt

13 Oct 2016

25 min read

Reconstructing 3D Scenes

Packt

13 Oct 2016

25 min read

0
0
7668

Packt

13 Oct 2016

10 min read

IoT and Decision Science

Packt

13 Oct 2016

10 min read

In this article by Jojo Moolayil, author of the book Smarter Decisions - The Intersection of Internet of Things and Decision Science, you will learn that the Internet of Things (IoT) and Decision Science have been among the hottest topics in the industry for a while now. You would have heard about IoT and wanted to learn more about it, but unfortunately you would have come across multiple names and definitions over the Internet with hazy differences between them. Also, Decision Science has grown from a nascent domain to become one of the fastest and most widespread horizontal in the industry in the recent years. With the ever-increasing volume, variety, and veracity of data, decision science has become more and more valuable for the industry. Using data to uncover latent patterns and insights to solve business problems has made it easier for businesses to take actions with better impact and accuracy. (For more resources related to this topic, see here.) Data is the new oil for the industry, and with the boom of IoT, we are in a world where more and more devices are getting connected to the Internet with sensors capturing more and more vital granular dimensions details that had never been touched earlier. The IoT is a game changer with a plethora of devices connected to each other; the industry is eagerly attempting to untap the huge potential that it can deliver. The true value and impact of IoT is delivered with the help of Decision Science. IoT has inherently generated an ocean of data where you can swim to gather insights and take smarter decisions with the intersection of Decision Science and IoT. In this book, you will learn about IoT and Decision Science in detail by solving real-life IoT business problems using a structured approach. In this article, we will begin by understanding the fundamental basics of IoT and Decision Science problem solving. You will learn the following concepts: Understanding IoT and demystifying Machine to Machine (M2M), IoT, Internet of Everything (IoE), and Industrial IoT (IIoT) Digging deeper into the logical stack of IoT Studying the problem life cycle Exploring the problem landscape The art of problem solving The problem solving framework It is highly recommended that you explore this article in depth. It focuses on the basics and concepts required to build problems and use cases. Understanding the IoT To get started with the IoT, lets first try to understand it using the easiest constructs. Internet and Things; we have two simple words here that help us understand the entire concept. So what is the Internet? It is basically a network of computing devices. Similarly, what is a Thing? It could be any real-life entity featuring Internet connectivity. So now, what do we decipher from IoT? It is a network of connected Things that can transmit and receive data from other things once connected to the network. This is how we describe the Internet of Things in a nutshell. Now, let's take a glance at the definition. IoT can be defined as the ever-growing network of Things (entities) that feature Internet connectivity and the communication that occurs between them and other Internet-enabled devices and systems. The Things in IoT are enabled with sensors that capture vital information from the device during its operations, and the device features Internet connectivity that helps it transfer and communicate to other devices and the network. Today, when we discuss about IoT, there are so many other similar terms that come into the picture, such as Industrial Internet, M2M, IoE, and a few more, and we find it difficult to understand the differences between them. Before we begin delineating the differences between these hazy terms and understand how IoT evolved in the industry, lets first take a simple real-life scenario to understand how exactly IoT looks like. IoT in a real-life scenario Let's take a simple example to understand how IoT works. Consider a scenario where you are a father in a family with a working mother and 10-year old son studying in school. You and your wife work in different offices. Your house is equipped with quite a few smart devices, say, a smart microwave, smart refrigerator, and smart TV. You are currently in office and you get notified on your smartphone that your son, Josh, has reached home from school. (He used his personal smart key to open the door.) You then use your smartphone to turn on the microwave at home to heat the sandwiches kept in it. Your son gets notified on the smart home controller that you have hot sandwiches ready for him. He quickly finishes them and starts preparing for a math test at school and you resume your work. After a while, you get notified again that your wife has also reached home (She also uses a similar smart key.) and you suddenly realize that you need to reach home to help your son with his math test. You again use your smartphone and change the air conditioner settings for three people and set the refrigerator to defrost using the app. In another 15 minutes, you are home and the air conditioning temperature is well set for three people. You then grab a can of juice from the refrigerator and discuss some math problems with your son on the couch. Intuitive, isnt it? How did it his happen and how did you access and control everything right from your phone? Well, this is how IoT works! Devices can talk to each other and also take actions based on the signals received: The IoT scenario Lets take a closer look at the same scenario. You are sitting in office and you could access the air conditioner, microwave, refrigerator, and home controller through your smartphone. Yes, the devices feature Internet connectivity and once connected to the network, they can send and receive data from other devices and take actions based on signals. A simple protocol helps these devices understand and send data and signals to a plethora of heterogeneous devices connected to the network. We will get into the details of the protocol and how these devices talk to each other soon. However, before that, we will get into some details of how this technology started and why we have so many different names today for IoT. Demystifying M2M, IoT, IIoT, and IoE So now that we have a general understanding about what is IoT, lets try to understand how it all started. A few questions that we will try to understand are: Is IoT very new in the market?, When did this start?, How did this start?, Whats the difference between M2M, IoT, IoE, and all those different names?, and so on. If we try to understand the fundamentals of IoT, that is, machines or devices connected to each other in a network, which isn't something really new and radically challenging, then what is this buzz all about? The buzz about machines talking to each other started long before most of us thought of it, and back then it was called Machine to Machine Data. In early 1950, a lot of machinery deployed for aerospace and military operations required automated communication and remote access for service and maintenance. Telemetry was where it all started. It is a process in which a highly automated communication was established from which data is collected by making measurements at remote or inaccessible geographical areas and then sent to a receiver through a cellular or wired network where it was monitored for further actions. To understand this better, lets take an example of a manned space shuttle sent for space exploration. A huge number of sensors are installed in such a space shuttle to monitor the physical condition of astronauts, the environment, and also the condition of the space shuttle. The data collected through these sensors is then sent back to the substation located on Earth, where a team would use this data to analyze and take further actions. During the same time, industrial revolution peaked and a huge number of machines were deployed in various industries. Some of these industries where failures could be catastrophic also saw the rise in machine-to-machine communication and remote monitoring: Telemetry Thus, machine-to-machine data a.k.a. M2M was born and mainly through telemetry. Unfortunately, it didnt scale to the extent that it was supposed to and this was largely because of the time it was developed in. Back then, cellular connectivity was not widespread and affordable, and installing sensors and developing the infrastructure to gather data from them was a very expensive deal. Therefore, only a small chunk of business and military use cases leveraged this. As time passed, a lot of changes happened. The Internet was born and flourished exponentially. The number of devices that got connected to the Internet was colossal. Computing power, storage capacities, and communication and technology infrastructure scaled massively. Additionally, the need to connect devices to other devices evolved, and the cost of setting up infrastructure for this became very affordable and agile. Thus came the IoT. The major difference between M2M and IoT initially was that the latter used the Internet (IPV4/6) as the medium whereas the former used cellular or wired connection for communication. However, this was mainly because of the time they evolved in. Today, heavy engineering industries have machinery deployed that communicate over the IPV4/6 network and is called Industrial IoT or sometimes M2M. The difference between the two is bare minimum and there are enough cases where both are used interchangeably. Therefore, even though M2M was actually the ancestor of IoT, today both are pretty much the same. M2M or IIoT are nowadays aggressively used to market IoT disruptions in the industrial sector. IoE or Internet of Everything was a term that surfaced on the media and Internet very recently. The term was coined by Cisco with a very intuitive definition. It emphasizes Humans as one dimension in the ecosystem. It is a more organized way of defining IoT. The IoE has logically broken down the IoT ecosystem into smaller components and simplified the ecosystem in an innovative way that was very much essential. IoE divides its ecosystem into four logical units as follows: People Processes Data Devices Built on the foundation of IoT, IoE is defined as The networked connection of People, Data, Processes, and Things. Overall, all these different terms in the IoT fraternity have more similarities than differences and, at the core, they are the same, that is, devices connecting to each other over a network. The names are then stylized to give a more intrinsic connotation of the business they refer to, such as Industrial IoT and Machine to Machine for (B2B) heavy engineering, manufacturing and energy verticals, Consumer IoT for the B2C industries, and so on. Summary In this article we learnt how to start with the IoT. It is basically a network of computing devices. Similarly, what is a Thing? It could be any real-life entity featuring Internet connectivity. So now, what do we decipher from IoT? It is a network of connected Things that can transmit and receive data from other things once connected to the network. This is how we describe the Internet of Things in a nutshell. Resources for Article: Further resources on this subject: Machine Learning Tasks [article] Welcome to Machine Learning Using the .NET Framework [article] Why Big Data in the Financial Sector? [article]

0
0
1789

article-image-solving-nlp-problem-keras-part-1

Sasank Chilamkurthy

12 Oct 2016

5 min read

Solving an NLP Problem with Keras, Part 1

Sasank Chilamkurthy

12 Oct 2016

5 min read

In a previous two-part post series on Keras, I introduced Convolutional Neural Networks(CNNs) and the Keras deep learning framework. We used them to solve a Computer Vision (CV) problem involving traffic sign recognition. Now, in this two-part post series, we will solve a Natural Language Processing (NLP) problem with Keras. Let’s begin. The Problem and the Dataset The problem we are going to tackle is Natural Language Understanding. The aim is to extract the meaning of speech utterances. This is still an unsolved problem. Therefore, we can break this problem into a solvable practical problem of understanding the speaker in a limited context. In particular, we want to identify the intent of a speaker asking for information about flights. The dataset we are going to use is Airline Travel Information System (ATIS). This dataset was collected by DARPA in the early 90s. ATIS consists of spoken queries on flight related information. An example utterance is I want to go from Boston to Atlanta on Monday. Understanding this is then reduced to identifying arguments like Destination and Departure Day. This task is called slot-filling. Here is an example sentence and its labels. You will observe that labels are encoded in an Inside Outside Beginning (IOB) representation. Let’s look at the dataset: |Words | Show | flights | from | Boston | to | New | York| today| |Labels| O | O | O |B-dept | O|B-arr|I-arr|B-date| The ATIS official split contains 4,978/893 sentences for a total of 56,590/9,198 words (average sentence length is 15) in the train/test set. The number of classes (different slots) is 128, including the O label (NULL). Unseen words in the test set are encoded by the <UNK> token, and each digit is replaced with string DIGIT;that is,20 is converted to DIGITDIGIT. Our approach to the problem is to use: Word embeddings Recurrent neural networks I'll talk about these briefly in the following sections. Word Embeddings Word embeddings map words to a vector in a high-dimensional space. These word embeddings can actually learn the semantic and syntactic information of words. For instance, they can understand that similar words are close to each other in this space and dissimilar words are far apart. This can be learned either using large amounts of text like Wikipedia, or specifically for a given problem. We will take the second approach for this problem. As an illustation, I have shown here the nearest neighbors in the word embedding space for some of the words. This embedding space was learned by the model that we’ll define later in the post: sunday delta california boston august time car wednesday continental colorado nashville september schedule rental saturday united florida toronto july times limousine friday american ohio chicago june schedules rentals monday eastern georgia phoenix december dinnertime cars tuesday northwest pennsylvania cleveland november ord taxi thursday us north atlanta april f28 train wednesdays nationair tennessee milwaukee october limo limo saturdays lufthansa minnesota columbus january departure ap sundays midwest michigan minneapolis may sfo later Recurrent Neural Networks Convolutional layers can be a great way to pool local information, but they do not really capture the sequentiality of data. Recurrent Neural Networks (RNNs) help us tackle sequential information like natural language. If we are going to predict properties of the current word, we better remember the words before it too. An RNN has such an internal state/memory that stores the summary of the sequence it has seen so far. This allows us to use RNNs to solve complicated word tagging problems such as Part Of Speech (POS) tagging or slot filling, as in our case. The following diagram illustrates the internals of RNN: Source: Nature RNN Let's briefly go through the diagram: Is the input to the RNN. x_1,x_2,...,x_(t-1),x_t,x_(t+1)... Is the hidden state of the RNN at the step. st This is computed based on the state at the step. t-1 As st=f(Uxt+Ws(t-1)) Here f is a nonlinearity such astanh or ReLU. ot Is the output at the step. t Computed as:ot=f(Vst)U,V,W Are the learnable parameters of RNN. For our problem, we will pass a word embeddings’ sequence as the input to the RNN. Putting it all together Now that we've setup the problem and have an understanding of the basic blocks, let's code it up. Since we are using the IOB representation for labels, it's not simpleto calculate the scores of our model. We therefore use the conlleval perl script to compute the F1 Scores. I've adapted the code from here for the data preprocessing and score calculation. The complete code is available at GitHub: $ git clone https://github.com/chsasank/ATIS.keras.git $ cd ATIS.keras I recommend using jupyter notebook to run and experiment with the snippets from the tutorial. $ jupyter notebook Conclusion In part 2, we will load the data using data.load.atisfull(). We will also define the Keras model, and then we will train the model. To measure the accuracy of the model, we’ll use model.predict_on_batch() and metrics.accuracy.conlleval(). And finally, we will improve our model to achieve better results. About the author Sasank Chilamkurthy works at Fractal Analytics. His work involves deep learning on medical images obtained from radiology and pathology. He is mainly interested in computer vision.

0
0
4638

article-image-basics-image-histograms-opencv

Packt

12 Oct 2016

11 min read

Basics of Image Histograms in OpenCV

Packt

12 Oct 2016

11 min read

In this article by Samyak Datta, author of the book Learning OpenCV 3 Application Development we are going to focus our attention on a different style of processing pixel values. The output of the techniques, which would comprise our study in the current article, will not be images, but other forms of representation for images, namely image histograms. We have seen that a two-dimensional grid of intensity values is one of the default forms of representing images in digital systems for processing as well as storage. However, such representations are not at all easy to scale. So, for an image with a reasonably low spatial resolution, say 512 x 512 pixels, working with a two-dimensional grid might not pose any serious issues. However, as the dimensions increase, the corresponding increase in the size of the grid may start to adversely affect the performance of the algorithms that work with the images. A primary advantage that an image histogram has to offer is that the size of a histogram is a constant that is independent of the dimensions of the image. As a consequence of this, we are guaranteed that irrespective of the spatial resolution of the images that we are dealing with, the algorithms that power our solutions will have to deal with a constant amount of data if they are working with image histograms. (For more resources related to this topic, see here.) Each descriptor captures some particular aspects or features of the image to construct its own form of representation. One of the common pitfalls of using histograms as a form of image representation as compared to its native form of using the entire two-dimensional grid of values is loss of information. A full-fledged image representation using pixel intensity values for all pixel locations naturally consists of all the information that you would need to reconstruct a digital image. However, the same cannot be said about histograms. When we study about image histograms in detail, we'll get to see exactly what information do we stand to lose. And this loss in information is prevalent across all forms of image descriptors. The basics of histograms At the outset, we will briefly explain the concept of a histogram. Most of you might already know this from your lessons on basic statistics. However, we will reiterate this for the sake of completeness. Histogram is a form of data representation technique that relies on an aggregation of data points. The data is aggregated into a set of predefined bins that are represented along the x axis, and the number of data points that fall within each of the bins make up the corresponding counts on the y axis. For example, let's assume that our data looks something like the following: D={2,7,1,5,6,9,14,11,8,10,13} If we define three bins, namely Bin_1 (1 - 5), Bin_2 (6 - 10), and Bin_3 (11 - 15), then the histogram corresponding to our data would look something like this: Bins Frequency Bin_1 (1 - 5) 3 Bin_2 (6 - 10) 5 Bin_3 (11 - 15) 3 What this histogram data tells us is that we have three values between 1 and 5, five between 6 and 10, and three again between 11 and 15. Note that it doesn't tell us what the values are, just that some n values exist in a given bin. A more familiar visual representation of the histogram in discussion is shown as follows: As you can see, the bins have been plotted along the x axis and their corresponding frequencies along the y axis. Now, in the context of images, how is a histogram computed? Well, it's not that difficult to deduce. Since the data that we have comprise pixel intensity values, an image histogram is computed by plotting a histogram using the intensity values of all its constituent pixels. What this essentially means is that the sequence of pixel intensity values in our image becomes the data. Well, this is in fact the simplest kind of histogram that you can compute using the information available to you from the image. Now, coming back to image histograms, there are some basic terminologies (pertaining to histograms in general) that you need to be aware of before you can dip your hands into code. We have explained them in detail here: Histogram size: The histogram size refers to the number of bins in the histogram. Range: The range of a histogram is the range of data that we are dealing with. The range of data as well as the histogram size are both important parameters that define a histogram. Dimensions: Simply put, dimensions refer to the number of the type of items whose values we aggregate in the histogram bins. For example, consider a grayscale image. We might want to construct a histogram using the pixel intensity values for such an image. This would be an example of a single-dimensional histogram because we are just interested in aggregating the pixel intensity values and nothing else. The data, in this case, is spread over a range of 0 to 255. On account of being one-dimensional, such histograms can be represented graphically as 2D plots—one-dimensional data (pixel intensity values) being plotted on the x axis (in the form of bins) along with the corresponding frequency counts along the y axis. We have already seen an example of this before. Now, imagine a color image with three channels: red, green, and blue. Let's say that we want to plot a histogram for the intensities in the red and green channels combined. This means that our data now becomes a pair of values (r, g). A histogram that is plotted for such data will have a dimensionality of 2. The plot for such a histogram will be a 3D plot with the data bins covering the x and y axes and the frequency counts plotted along the z axis. Now that we have discussed the theoretical aspects of image histograms in detail, let's start thinking along the lines of code. We will start with the simplest (and in fact the most ubiquitous) design of image histograms. The range of our data will be from 0 to 255 (both inclusive), which means that all our data points will be integers that fall within the specified range. Also, the number of data points will equal the number of pixels that make up our input image. The simplicity in design comes from the fact that we fix the size of the histogram (the number of bins) as 256. Now, take a moment to think about what this means. There are 256 different possible values that our data points can take and we have a separate bin corresponding to each one of those values. So such an image histogram will essentially depict the 256 possible intensity values along with the counts of the number of pixels in the image that are colored with each of the different intensities. Before taking a peek at what OpenCV has to offer, let's try to implement such a histogram on our own! We define a function named computeHistogram() that takes the grayscale image as an input argument and returns the image histogram. From our earlier discussions, it is evident that the histogram must contain 256 entries (for the 256 bins): one for each integer between 0 and 255. The value stored in the histogram corresponding to each of the 256 entries will be the count of the image pixels that have a particular intensity value. So, conceptually, we can use an array for our implementation such that the value stored in the histogram [ i ] (for 0≤i≤255) will be the count of the number of pixels in the image having the intensity of i. However, instead of using a C++ array, we will comply with the rules and standards followed by OpenCV and represent the histogram as a Mat object. We have already seen that a Mat object is nothing but a multidimensional array store. The implementation is outlined in the following code snippet: Mat computeHistogram(Mat input_image) { Mat histogram = Mat::zeros(256, 1, CV_32S); for (int i = 0; i < input_image.rows; ++i) { for (int j = 0; j < input_image.cols; ++j) { int binIdx = (int) input_image.at<uchar>(i, j); histogram.at<int>(binIdx, 0) += 1; } } return histogram; } As you can see, we have chosen to represent the histogram as a 256-element-column-vector Mat object. We iterate over all the pixels in the input image and keep on incrementing the corresponding counts in the histogram (which had been initialized to 0). As per our description of the image histogram properties, it is easy to see that the intensity value of any pixel is the same as the bin index that is used to index into the appropriate histogram bin to increment the count. Having such an implementation ready, let's test it out with the help of an actual image. The following code demonstrates a main() function that reads an input image, calls the computeHistogram() function that we have defined just now, and displays the contents of the histogram that is returned as a result: int main() { Mat input_image = imread("/home/samyak/Pictures/lena.jpg", IMREAD_GRAYSCALE); Mat histogram = computeHistogram(input_image); cout << "Histogram...n"; for (int i = 0; i < histogram.rows; ++i) cout << i << " : " << histogram.at<int>(i, 0) << "n"; return 0; } We have used the fact that the histogram that is returned from the function will be a single column Mat object. This makes the code that displays the contents of the histogram much cleaner. Histograms in OpenCV We have just seen the implementation of a very basic and minimalistic histogram using the first principles in OpenCV. The image histogram was basic in the sense that all the bins were uniform in size and comprised only a single pixel intensity. This made our lives simple when we designed our code for the implementation; there wasn't any need to explicitly check the membership of a data point (the intensity value of a pixel) with all the bins of our histograms. However, we know that a histogram can have bins whose sizes span more than one. Can you think of the changes that we might need to make in the code that we had written just now to accommodate for bin sizes larger than 1? If this change seems doable to you, try to figure out how to incorporate the possibility of non-uniform bin sizes or multidimensional histograms. By now, things might have started to get a little overwhelming to you. No need to worry. As always, OpenCV has you covered! The developers at OpenCV have provided you with a calcHist() function whose sole purpose is to calculate the histograms for a given set of arrays. By arrays, we refer to the images represented as Mat objects, and we use the term set because the function has the capability to compute multidimensional histograms from the given data: Mat computeHistogram(Mat input_image) { Mat histogram; int channels[] = { 0 }; int histSize[] = { 256 }; float range[] = { 0, 256 }; const float* ranges[] = { range }; calcHist(&input_image, 1, channels, Mat(), histogram, 1, histSize, ranges, true, false); return histogram; } Before we move on to an explanation of the different parameters involved in the calcHist() function call, I want to bring your attention to the abundant use of arrays in the preceding code snippet. Even arguments as simple as histogram sizes are passed to the function in the form of arrays rather than integer values, which at first glance seem quite unnecessary and counter-intuitive. The usage of arrays is due to the fact that the implementation of calcHist() is equipped to handle multidimensional histograms as well, and when we are dealing with such multidimensional histogram data, we require multiple parameters to be passed, one for each dimension. This would become clearer once we demonstrate an example of calculating multidimensional histograms using the calcHist() function. For the time being, we just wanted to clear the immediate confusion that might have popped up in your minds upon seeing the array parameters. Here is a detailed list of the arguments in the calcHist() function call: Source images Number of source images Channel indices Mask Dimensions (dims) Histogram size Ranges Uniform flag Accumulate flag The last couple of arguments (the uniform and accumulate flags) have default values of true and false, respectively. Hence, the function call that you have seen just now can very well be written as follows: calcHist(&input_image, 1, channels, Mat(), histogram, 1, histSize, ranges); Summary Thus in this article we have successfully studied fundamentals of using histograms in OpenCV for image processing. Resources for Article: Further resources on this subject: Remote Sensing and Histogram [article] OpenCV: Image Processing using Morphological Filters [article] Learn computer vision applications in Open CV [article]

0
0
22139

article-image-thinking-probabilistically

Packt

04 Oct 2016

16 min read

Thinking Probabilistically

Packt

04 Oct 2016

16 min read

0
0
2255

article-image-supervised-machine-learning

Packt

04 Oct 2016

13 min read

Supervised Machine Learning

Packt

04 Oct 2016

13 min read

In this article by Anshul Joshi, the author of the book Julia for Data Science, we will learn that data science involves understanding data, gathering data, munging data, taking the meaning out of that data, and then machine learning if needed. Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. (For more resources related to this topic, see here.) The key features offered by Julia are: A general purpose high-level dynamic programming language designed to be effective for numerical and scientific computing A Low-Level Virtual Machine (LLVM) based Just-in-Time (JIT) compiler that enables Julia to approach the performance of statically-compiled languages like C/C++ What is machine learning? Generally, when we talk about machine learning, we get into the idea of us fighting wars with intelligent machines that we created but went out of control. These machines are able to outsmart the human race and become a threat to human existence. These theories are nothing but created for our entertainment. We are still very far away from such machines. So, the question is: what is machine learning? Tom M. Mitchell gave a formal definition- "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." It says that machine learning is teaching computers to generate algorithms using data without programming them explicitly. It transforms data into actionable knowledge. Machine learning has close association with statistics, probability, and mathematical optimization. As technology grew, there is one thing that grew with it exponentially—data. We have huge amounts of unstructured and structured data growing at a very great pace. Lots of data is generated by space observatories, meteorologists, biologists, fitness sensors, surveys, and so on. It is not possible to manually go through this much amount of data and find patterns or gain insights. This data is very important for scientists, domain experts, governments, health officials, and even businesses. To gain knowledge out of this data, we need self-learning algorithms that can help us in decision making. Machine learning evolved as a subfield of artificial intelligence, which eliminates the need to manually analyze large amounts of data. Instead of using machine learning, we make data-driven decisions by gaining knowledge using self-learning predictive models. Machine learning has become important in our daily lives. Some common use cases include search engines, games, spam filters, and image recognition. Self-driving cars also use machine learning. Some basic terminologies used in machine learning: Features: Distinctive characteristics of the data point or record Training set: This is the dataset that we feed to train the algorithm that helps us to find relationships or build a model Testing set: The algorithm generated using the training dataset is tested on the testing dataset to find the accuracy Feature vector: An n-dimensional vector that contains the features defining an object Sample: An item from the dataset or the record Uses of machine learning Machine learning in one way or another is used everywhere. Its applications are endless. Let's discuss some very common use cases: E-mail spam filtering: Every major e-mail service provider uses machine learning to filter out spam messages from the Inbox to the Spam folder. Predicting storms and natural disasters: Machine learning is used by meteorologists and geologists to predict the natural disasters using weather data, which can help us to take preventive measures. Targeted promotions/campaigns and advertising: On social sites, search engines, and maybe in mailboxes, we see advertisements that somehow suit our taste. This is made feasible using machine learning on the data from our past searches, our social profile or the e-mail contents. Self-driving cars: Technology giants are currently working on self driving cars. This is made possible using machine learning on the feed of the actual data from human drivers, image and sound processing, and various other factors. Machine learning is also used by businesses to predict the market. It can also be used to predict the outcomes of elections and the sentiment of voters towards a particular candidate. Machine learning is also being used to prevent crime. By understanding the pattern of the different criminals, we can predict a crime that can happen in future and can prevent it. One case that got a huge amount of attention was of a big retail chain in the United States using machine learning to identify pregnant women. The retailer thought of the strategy to give discounts on multiple maternity products, so that they would become loyal customers and will purchase items for babies which have a high profit margin. The retailer worked on the algorithm to predict the pregnancy using useful patterns in purchases of different products which are useful for pregnant women. Once a man approached the retailer and asked for the reason that his teenage daughter is receiving discount coupons for maternity items. The retail chain offered an apology but later the father himself apologized when he got to know that his daughter was indeed pregnant. This story may or may not be completely true, but retailers indeed analyze their customers' data routinely to find out patterns and for targeted promotions, campaigns, and inventory management. Machine learning and ethics Let's see where machine learning is used very frequently: Retailers: In the previous example, we mentioned how retail chains use data for machine learning to increase their revenue as well as to retain their customers. Spam filtering: E-mails are processed using various machine learning algorithms for spam filtering. Targeted advertisements: In our mailbox, social sites, or search engines, we see advertisements of our liking. These are only some of the actual use cases that are implemented in the world today. One thing that is common between them is the user data. In the first example, retailers are using the history of transactions done by the user for targeted promotions and campaigns and for inventory management, among other things. Retail giants do this by providing users a loyalty or sign-up card. In the second example, the e-mail service provider uses trained machine learning algorithms to detect and flag spam. It does by going through the contents of e-mail/attachments and classifying the sender of the e-mail. In the third example, again the e-mail provider, social network, or search engine will go through our cookies, our profile, or our mails to do the targeted advertising. In all of these examples, it is mentioned in the terms and conditions of the agreement when we sign up with the retailer, e-mail provider, or social network that the user's data will be used but privacy will not be violated. It is really important that before using data that is not publicly available, we take the required permissions. Also, our machine learning models shouldn't do discrimination on the basis of region, race, and sex or of any other kind. The data provided should not be used for purposes not mentioned in the agreement or illegal in the region or country of existence. Machine learning – the process Machine learning algorithms are trained in keeping with the idea of how the human brain works. They are somewhat similar. Let's discuss the whole process. The machine learning process can be described in three steps: Input Abstraction Generalization These three steps are the core of how the machine learning algorithm works. Although the algorithm may or may not be divided or represented in such a way, this explains the overall approach. The first step concentrates on what data should be there and what shouldn't. On the basis of that, it gathers, stores, and cleans the data as per the requirements. The second step involves that the data be translated to represent the bigger class of data. This is required as we cannot capture everything and our algorithm should not be applicable for only the data that we have. The third step focuses on the creation of the model or an action that will use this abstracted data, which will be applicable for the broader mass. So, what should be the flow of approaching a machine learning problem? In this particular figure, we see that the data goes through the abstraction process before it can be used to create the machine learning algorithm. This process itself is cumbersome. The process follows the training of the model, which is fitting the model into the dataset that we have. The computer does not pick up the model on its own, but it is dependent on the learning task. The learning task also includes generalizing the knowledge gained on the data that we don't have yet. Therefore, training the model is on the data that we currently have and the learning task includes generalization of the model for future data. It depends on our model how it deduces knowledge from the dataset that we currently have. We need to make such a model that can gather insights into something that wasn't known to us before and how it is useful and can be linked to the future data. Different types of machine learning Machine learning is divided mainly into three categories: Supervised learning Unsupervised learning Reinforcement learning In supervised learning, the model/machine is presented with inputs and the outputs corresponding to those inputs. The machine learns from these inputs and applies this learning in further unseen data to generate outputs. Unsupervised learning doesn't have the required outputs; therefore it is up to the machine to learn and find patterns that were previously unseen. In reinforcement learning, the machine continuously interacts with the environment and learns through this process. This includes a feedback loop. Understanding decision trees Decision tree is a very good example of divide and conquer. It is one of the most practical and widely used methods for inductive inference. It is a supervised learning method that can be used for both classification and regression. It is non-parametric and its aim is to learn by inferring simple decision rules from the data and create such a model that can predict the value of the target variable. Before taking a decision, we analyze the probability of the pros and cons by weighing the different options that we have. Let's say we want to purchase a phone and we have multiple choices in the price segment. Each of the phones has something really good, and maybe better than the other. To make a choice, we start by considering the most important feature that we want. And like this, we create a series of features that it has to pass to become the ultimate choice. In this section, we will learn about: Decision trees Entropy measures Random forests We will also learn about famous decision tree learning algorithms such as ID3 and C5.0. Decision tree learning algorithms There are various decision tree learning algorithms that are actually variations of the core algorithm. The core algorithm is actually a top-down, greedy search through all possible trees. We are going to discuss two algorithms: ID3 C4.5 and C5.0 The first algorithm, Iterative Dichotomiser 3 (ID3), was developed by Ross Quinlan in 1986. The algorithm proceeds by creating a multiway tree, where it uses greedy search to find each node and the features that can yield maximum information gain for the categorical targets. As trees can grow to the maximum size, which can result in over-fitting of data, pruning is used to make the generalized model. C4.5 came after ID3 and eliminated the restriction that all features must be categorical. It does this by defining dynamically a discrete attribute based on the numerical variables. This partitions into a discrete set of intervals from the continuous attribute value. C4.5 creates sets of if-then rules from the trained trees of the ID3 algorithm. C5.0 is the latest version; it builds smaller rule sets and uses comparatively lesser memory. An example Let's apply what we've learned to create a decision tree using Julia. We will be using the example available for Python on scikit-learn.org and Scikitlearn.jl by Cedric St-Jean. We will first have to add the required packages: We will first have to add the required packages: julia> Pkg.update() julia> Pkg.add("DecisionTree") julia> Pkg.add("ScikitLearn") julia> Pkg.add("PyPlot") ScikitLearn provides the interface to the much-famous library of machine learning for Python to Julia: julia> using ScikitLearn julia> using DecisionTree julia> using PyPlot After adding the required packages, we will create the dataset that we will be using in our example: julia> # Create a random dataset julia> srand(100) julia> X = sort(5 * rand(80)) julia> XX = reshape(X, 80, 1) julia> y = sin(X) julia> y[1:5:end] += 3 * (0.5 – rand(16)) This will generate a 16-element Array{Float64,1}. Now we will create instances of two different models. One model is where we will not limit the depth of the tree, and in other model, we will prune the decision tree on the basis of purity: We will now fit the models to the dataset that we have. We will fit both the models. This is the first model. Here our decision tree has 25 leaf nodes and a depth of 8. This is the second model. Here we prune our decision tree. This has six leaf nodes and a depth of 4. Now we will use the models to predict on the test dataset: julia> # Predict julia> X_test = 0:0.01:5.0 julia> y_1 = predict(regr_1, hcat(X_test)) julia> y_2 = predict(regr_2, hcat(X_test)) This creates a 501-element Array{Float64,1}. To better understand the results, let's plot both the models on the dataset that we have: julia> # Plot the results julia> scatter(X, y, c="k", label="data") julia> plot(X_test, y_1, c="g", label="no pruning", linewidth=2) julia> plot(X_test, y_2, c="r", label="pruning_purity_threshold=0.05", linewidth=2) julia> xlabel("data") julia> ylabel("target") julia> title("Decision Tree Regression") julia> legend(prop=Dict("size"=>10)) Decision trees can tend to overfit data. It is required to prune the decision tree to make it more generalized. But if we do more pruning than required, then it may lead to an incorrect model. So, it is required that we find the most optimized pruning level. It is quite evident that the first decision tree overfits to our dataset, whereas the second decision tree model is comparatively more generalized. Summary In this article, we learned about machine learning and its uses. Providing computers the ability to learn and improve has far-reaching uses in this world. It is used in predicting disease outbreaks, predicting weather, games, robots, self-driving cars, personal assistants, and lot more. There are three different types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. We also learned about decision trees. Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Basics of Programming in Julia [article] More about Julia [article]

0
0
1994

Packt

30 Sep 2016

9 min read

Parallel Computing

Packt

30 Sep 2016

9 min read

In this article written by Jalem Raj Rohit, author of the book Julia Cookbook, cover the following recipes: Basic concepts of parallel computing Data movement Parallel map and loop operations Channels (For more resources related to this topic, see here.) Introduction In this article, you will learn about performing parallel computing and using it to handle big data. So, some concepts like data movements, sharded arrays, and the map-reduce framework are important to know in order to handle large amounts of data by computing on it using parallelized CPUs. So, all the concepts discussed in this article will help you build good parallel computing and multiprocessing basics, including efficient data handling and code optimization. Basic concepts of parallel computing Parallel computing is a way of dealing with data in a parallel way. This can be done by connecting multiple computers as a cluster and using their CPUs for carrying out the computations. This style of computation is used when handling large amounts of data and also while running complex algorithms over significantly large data. The computations are executed faster due to the availability of multiple CPUs running them in parallel as well as the direct availability of RAM to each of them. Getting ready Julia has an in-built support for parallel computing and multiprocessing. So, these computations rarely require any external libraries for the task. How to do it… Julia can be started in your local computer using multiple cores of your CPU. So, we will now have multiple workers for the process. This is how you can fire up Julia in the multi-processing mode in your terminal. This creates two worker process in the machine, which means it uses twwo CPU cores for the purpose julia -p 2 The output looks something like this. It might differ for different operating systems and different machines: Now, we will look at the remotecall() function. It takes in multiple arguments, the first one being the process which we want to assign the task to. The next argument would be the function which we want to execute. The subsequent arguments would be the parameters or the arguments of that function which we want to execute. In this example, we will create a 2 x 2 random matrix and assign it to the process number 2. This can be done as follows: task = remotecall(2, rand, 2, 2) The preceding command gives the following output: Now that the remotecall() function for remote referencing has been executed, we will fetch the results of the function through the fetch() function. This can be done as follows: fetch(task) The preceding command gives the following output: Now, to perform some mathematical operations on the generated matrix, we can use the @spawnat macro, which takes in the mathematical operation and the fetch() function. The @spawnat macro actually wraps the expression 5 .+ fetch(task) into an anonymous function and runs it on the second machine This can be done as follows: task2 = @spawnat 5 .+ fetch(task) There is also a function that eliminates the need of using two different functions: remotecall() and fetch(). The remotecall_fetch() function takes in multiple arguments. The first one being the process that the task is being assigned. The next argument is the function which you want to be executed. The subsequent arguments would be the arguments or the parameters of the function that you want to execute. Now, we will use the remote call_fetch() function to fetch an element of the task matrix for a particular index. This can be done as follows: remotecall_fetch(2, getindex, task2, 1, 1) How it works… Julia can be started in the multiprocessing mode by specifying the number of processes needed while starting up the REPL. In this example, we started Julia as a two process mode. The maximum number of processes depends on the number of cores available in the CPU. The remotecall() function helps in selecting a particular process from the running processes in order to run a function or, in fact, any computation for us. The fetch() function is used to fetch the results of the remotecall() function from a common data resource (or the process) for all the running processes. The details of the data source would be covered in the later sections. The results of the fetch() function can also be used for further computations, which can be carried out with the @spawnat macro along with the results of fetch(). This would assign a process for the computation. The remotecall_fetch() function further eliminates the need for the fetch function in case of a direct execution. This has both the remotecall() and fetch() operations built into it. So, it acts as a combination of both the second and third points in this section. Data movement In parallel computing, data movements are quite common and are also a thing to be minimized due to the time and the network overhead due to the movements. In this recipe, we will see how that can be optimized to avoid latency as much as we can. Getting ready To get ready for this recipe, you need to have the Julia REPL started in the multiprocessing mode. This is explained in the Getting ready section of the preceding recipe. How to do it… Firstly, we will see how to do a matrix computation using the @spawn macro, which helps in data movement. So, we construct a matrix of shape 200 x 200 and then try to square it using the @spawn macro. This can be done as follows: mat = rand(200, 200) exec_mat = @spawn mat^2 fetch(exec_mat) The preceding command gives the following output: Now, we will look at an another way to achieve the same. This time, we will use the @spawn macro directly instead of the initialization step. We will discuss the advantages and drawbacks of each method in the How it works… section. So, this can be done as follows: mat = @spawn rand(200, 200)^2 fetch(mat) The preceding command gives the following output: How it works… In this example, we try to construct a 200X200 matrix and then used the @spawn macro to spawn a process in the CPU to execute the same for us. The @spawn macro spawns one of the two processes running, and it uses one of them for the computation. In the second example, you learned how to use the @spawn macro directly without an extra initialization part. The fetch() function helps us fetch the results from a common data resource of the processes. More on this will be covered in the following recipes. Parallel maps and loop operations In this recipe, you will learn a bit about the famous Map Reduce framework and why it is one of the most important ideas in the domains of big data and parallel computing. You will learn how to parallelize loops and use reducing functions on them through the several CPUs and machines and the concept of parallel computing, which you learned about in the previous recipes. Getting ready Just like the previous sections, Julia just needs to be running in the multiprocessing mode to follow along the following examples. This can be done through the instructions given in the first section. How to do it… Firstly, we will write a function that takes and adds n random bits. The writing of this function has nothing to do with multiprocessing. So, it has simple Julia functions and loops. This function can be written as follows: Now, we will use the @spawn macro, which we learned previously to run the count_heads() function as separate processes. The count_heads()function needs to be in the same directory for this to work. This can be done as follows: require("count_heads") a = @spawn count_heads(100) b = @spawn count_heads(100) fetch(a) + fetch(b) However, we can use the concept of multi-processing and parallelize the loop directly as well as take the sum. The parallelizing part is called mapping, and the addition of the parallelized bits is called reduction. Thus, the process constitutes the famous Map-Reduce framework. This can be made possible using the @parallel macro, as follows: nheads = @parallel (+) for i = 1:200 Int(rand(Bool)) end How it works… The first function is a simple Julia function that adds random bits with every loop iteration. It was created just for the demonstration of Map-Reduce operations. In the second point, we spawn two separate processes for executing the function and then fetch the results of both of them and add them up. However, that is not really a neat way to carry out parallel computation of functions and loops. Instead, the @parallel macro provides a better way to do it, which allows the user to parallelize the loop and then reduce the computations through an operator, which together would be called the Map-Reduce operation. Channels Channels are like the background plumbing for parallel computing in Julia. They are like the reservoirs from where the individual processes access their data from. Getting ready The requisite is similar to the previous sections. This is mostly a theoretical section, so you just need to run your experiments on your own. For that, you need to run your Julia REPL in a multiprocessing mode. How to do it… Channels are shared queues with a fixed length. They are common data reservoirs for the processes which are running. The channels are like common data resources, which multiple readers or workers can access. They can access the data through the fetch() function, which we already discussed in the previous sections. The workers can also write to the channel through the put!() function. This means that the workers can add more data to the resource, which can be accessed by all the workers running a particular computation. Closing a channel after usage is a good practice to avoid data corruption and unnecessary memory usage. It can be done using the close() function. Summary In this article we covered the basic concepts of parallel computing and data movement that takes place in the network. We also learned about parallel maps and loop operations along with the famous Map Reduce framework. At the end we got a brief understanding of channels and how individual processes access their data from channels. Resources for Article: Further resources on this subject: More about Julia [article] Basics of Programming in Julia [article] Simplifying Parallelism Complexity in C# [article]

0
0
3370

Preetham Sreenivas

29 Sep 2016

10 min read

Deep Learning with Torch

Preetham Sreenivas

29 Sep 2016

10 min read

Torch is a scientific computing framework built on top of Lua[JIT]. The nn package and the ecosystem around it provide a very powerful framework for building deep learning models, striking a perfect balance between speed and flexibility. It is used at Facebook AI Research(FAIR), Twitter Cortex, DeepMind, Yann LeCun's group at NYU, Fei-Fei Li's at Stanford, and many more industrial and academic labs. If you are like me, and don't like writing equations for backpropagation every time you want to try a simple model, Torch is a great solution. With Torch, you can also do pretty much anything you can imagine, whether that is writing custom loss functions, dreaming up an arbitrary acyclic graph network, or even using multiple GPUs or loading pre-trained models on imagenet from caffe model-zoo (yes, you can load models trained in caffe with a single line). Without further ado, let's jump right into the awesome world of deep learning. Prerequisites Some knowledge of deep learning—A Primer, Bengio's deep learning book, Hinton's Coursera course. A bit of Lua. Its syntax is very C-like and can be picked up fairly quickly if you know Python or JavaScript—Learn Lua in 15 minutes, Torch For Numpy Users. A machine with Torch installed since this is intended to be hands-on. On Ubuntu 12+ and Mac OS X, installing Torch looks like this: # in a terminal, run the commands WITHOUT sudo $ git clone https://github.com/torch/distro.git ~/torch --recursive $ cd ~/torch; bash install-deps; $ ./install.sh # On Linux with bash $ source ~/.bashrc # On OSX or in Linux with no bash. $ source ~/.profile Once you’ve installed Torch, you can run a Torch script using: $ th script.lua # alternatively you can fire up a terminal torch interpreter using th -i $ th -i # and run multiple scripts one by one, the variables will be accessible to other scripts > dofile 'script1.lua' > dofile 'script2.lua' > print(variable) -- variable from either of these scripts. The sections below are very code intensive, but you can run these commands from Torch's terminal interpreter. $th -i Building a Model: The Basics A module is the basic building block of any Torch model. It has forward and backward methods for forward and backward passes of backpropagation. You can combine them using containers, and of course, calling forward and backward on containers propagates inputs and gradients correctly. -- A simple mlp model with sigmoids require 'nn' linear1 = nn.Linear(100,10) -- A linear layer Module linear2 = nn.Linear(10,2) -- You can combine modulues using containers, sequential is the most used one model = nn.Sequential() -- A container model:add(linear1) model:add(nn.Sigmoid()) model:add(linear2) model:add(nn.Sigmoid()) -- the forward step input = torch.rand(100) target = torch.rand(2) output = linear:forward(input) Now we need a criterion to measure how well our model is performing, in other words, a loss function. nn.Criterion is the abstract class that all loss functions inherit. It provides forward and backward methods, computing loss and gradients respectively. Torch provides most of the commonly used criterions out of the box. It isn't much of an effort to write your own either. criterion = nn.MSECriterioin() -- mean squared error criterion loss = criterion:forward(output,target) gradientsAtOutput = criterion:backward(output,target) -- To perform the backprop step, we need to pass these gradients to the backward -- method of the model gradAtInput = model:backward(input,gradientsAtOutput) lr = 0.1 -- learning rate for our model model:updateParameters(lr) -- updates the parameters using the lr parameter. The updateParameters method just subtracts the model parameters by gradients scaled by the learning rate. This is the vanilla stochastic gradient descent. Typically, the updates we do are more complex. For example, if we want to use momentum, we need to keep a track of updates we did in the previous epoch. There are a lot more fancy optimization schemes such as RMSProp, adam, adagrad, and L-BFGS that do more complex things like adapting learning rate, momentum factor, and so on. The optim package provides optimization routines out of the box. Dataset We'll use the German Traffic Sign Recognition Benchmark(GTSRB) dataset. This dataset has 43 classes of traffic signs of varying sizes, illuminations and occlusions. There are 39,000 training images and 12,000 test images. Traffic signs in each of the images are not centered and they have a 10% border around them. I have included a shell script for downloading the data along with the code for this tutorial in this github repo.[1] git clone https://github.com/preethamsp/tutorial.gtsrb.torch.git cd tutorial.gtsrb.torch/datasets bash download_gtsrb.sh Model Let's build a downsized vgg style model with what we've learned. function createModel() require 'nn' nbClasses = 43 local net = nn.Sequential() --[[building block: adds a convolution layer, batch norm layer and a relu activation to the net]]-- function ConvBNReLU(nInputPlane, nOutputPlane) The code in the repo is much more polished than the snippets in the tutorial. It is modular and allows you to change the model and/or datasets easily. -- kernel size = (3,3), stride = (1,1), padding = (1,1) net:add(nn.SpatialConvolution(nInputPlane, nOutputPlane, 3,3, 1,1, 1,1)) net:add(nn.SpatialBatchNormalization(nOutputPlane,1e-3)) net:add(nn.ReLU(true)) end ConvBNReLU(3,32) ConvBNReLU(32,32) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) ConvBNReLU(32,64) ConvBNReLU(64,64) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) ConvBNReLU(64,128) ConvBNReLU(128,128) net:add(nn.SpatialMaxPooling(2,2,2,2)) net:add(nn.Dropout(0.2)) net:add(nn.View(128*6*6)) net:add(nn.Dropout(0.5)) net:add(nn.Linear(128*6*6,512)) net:add(nn.BatchNormalization(512)) net:add(nn.ReLU(true)) net:add(nn.Linear(512,nbClasses)) net:add(nn.LogSoftMax()) return net end The first layer contains three input channels because we're going to pass RGB images (three channels). For grayscale images, the first layer has one input channel. I encourage you to play around and modify the network.[2] There are a bunch of new modules that need some elaboration. The Dropout module randomly deactivates a neuron with some probability. It is known to help generalization by preventing co-adaptation between neurons; that is, a neuron should now depend less on its peer, forcing it to learn a bit more. BatchNormalization is a very recent development. It is known to speed up convergence by normalizing the outputs of a layer to unit gaussian using the statistics of a batch. Let’s use this model and train it. In the interest of brievity, I'll use these constructs directly. The code describing these constructs is in datasets/gtsrb.lua. DataGen:trainGenerator(batchSize) DataGen:valGenerator(batchSize) These provide iterators over batches of train and test data respectively. You'll find that the model code (models/vgg_small.lua) in the repo is different. It is designed to allow you to experiment quickly. Using optim to train the model Using a stochastic gradient descent (sgd) from the optim package to minimize a function f looks like this: optim.sgd(feval, params, optimState) Where: feval: A user-defined function that respects the API: f, df/params = feval(params) params: The current parameter vector (a 1D torch.Tensor) optimState: A table of parameters, and state variables, dependent upon the algorithm Since we are optimizing the loss of the neural network, parameters should be the weights and other parameters of the network. We get these as a flattened 1D tensor using model:getParameters. It also returns a tensor containing the gradients of these parameters. This is useful in creating the feval function above. model = createModel() criterion = nn.ClassNLLCriterion() -- criterion we are optimizing: negative log loss params, gradParams = model:getParameters() local function feval() -- criterion.output stores the latest output of criterion return criterion.output, gradParams end We need to create an optimState table and initialize it with a configuration of our optimizer like learning rate and momentum: optimState = { learningRate = 0.01, momentum = 0.9, dampening = 0.0, nesterov = true, } Now, an update to the model should do the following: Compute the output of the model using model:forward(). Compute the loss and the gradients at output layer using criterion:forward() and criterion:backward() respectively. Update the gradients of the model parameters using model:backward(). Update the model using optim.sgd. -- Forward pass output = model:forward(input) loss = criterion:forward(output, target) -- Backward pass critGrad = criterion:backward(output, target) model:backward(input, critGrad) -- Updates optim.sgd(feval, params, optimState) Note: The order above should be respected, as backward assumes forward was run just before it. Changing this order might result in gradients not being computed correctly. Putting it all together Let's put it all together and write a function that trains the model for an epoch. We'll create a loop that iterates over the train data in batches and updates the model. model = createModel() criterion = nn.ClassNLLCriterion() dataGen = DataGen('datasets/GTSRB/') -- Data generator params, gradParams = model:getParameters() batchSize = 32 optimState = { learningRate = 0.01, momentum = 0.9, dampening = 0.0, nesterov = true, } function train() -- Dropout and BN behave differently during training and testing -- So, switch to training mode model:training() local function feval() return criterion.output, gradParams end for input, target in dataGen:trainGenerator(batchSize) do -- Forward pass local output = model:forward(input) local loss = criterion:forward(output, target) -- Backward pass model:zeroGradParameters() -- clear grads from previous update local critGrad = criterion:backward(output, target) model:backward(input, critGrad) -- Updates optim.sgd(feval, params, optimState) end end The test function is extremely similar, except that we don't need to update the parameters: confusion = optim.ConfusionMatrix(nbClasses) -- to calculate accuracies function test() model:evaluate() -- switch to evaluate mode confusion:zero() -- clear confusion matrix for input, target in dataGen:valGenerator(batchSize) do local output = model:forward(input) confusion:batchAdd(output, target) end confusion:updateValids() local test_acc = confusion.totalValid * 100 print(('Test accuracy: %.2f'):format(test_acc)) end Now that everything is set, you can train your network and print the test accuracies: max_epoch = 20 for i = 1,20 do train() test() end An epoch takes around 30 seconds on a TitanX and gives about 97.7% accuracy after 20 epochs. This is a very basic model and honestly I haven't tried optimizing the parameters much. There are a lot of things that can be done to crank up the accuracies. Try different processing procedures. Experiment with the net structure. Different weight initializations, and learning rate schedules. An Ensemble of different models; for example, train multiple models and take a majority vote. You can have a look at the state of the art on this dataset here. They achieve upwards of 99.5% accuracy using a clever method to boost the geometric variation of CNNs. Conclusion We looked at how to build a basic mlp in Torch. We then moved on to building a Convolutional Neural Network and trained it to solve a real-world problem of traffic sign recognition. For a beginner, Torch/LUA might not be as easy. But once you get a hang of it, you have access to a deep learning framework which is very flexible yet fast. You will be able to easily reproduce latest research or try new stuff unlike in rigid frameworks like keras or nolearn. I encourage you to give it a fair try if you are going anywhere near deep learning. Resources Torch Cheat Sheet Awesome Torch Torch Blog Facebook's Resnet Code Oxford's ML Course Practicals Learn torch from Github repos About the author Preetham Sreenivas is a data scientist at Fractal Analytics. Prior to that, he was a software engineer at Directi.

0
0
11302

How-To Tutorials - Data

Getting Started with Python Packages

The Data Science Venn Diagram

Jupyter and Python Scripting

Heart Diseases Prediction using Spark 2.0.0

Diving into Data – Search and Report

Solving an NLP Problem with Keras, Part 2

Spark for Beginners

Reconstructing 3D Scenes

IoT and Decision Science

Solving an NLP Problem with Keras, Part 1

Trending Topics

Basics of Image Histograms in OpenCV

Thinking Probabilistically

Supervised Machine Learning

Parallel Computing

Deep Learning with Torch

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access