Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-getting-to-know-tensorflow
Kartikey Pandey
29 Nov 2017
7 min read
Save for later

Getting to know TensorFlow

Kartikey Pandey
29 Nov 2017
7 min read
[box type="note" align="" class="" width=""] The following book excerpt is from the title Machine Learning Algorithms by Guiseppe Bonaccorso. The book describes important Machine Learning algorithms commonly used in the field of data science. These algorithms can be used for supervised as well as unsupervised learning, reinforcement learning, and semi-supervised learning. Few famous ones covered in the book are Linear regression, Logistic Regression, SVM, Naive Bayes, K-Means, Random Forest, TensorFlow, and Feature engineering. [/box] Here, in the article, we look at understanding most important Deep learning library-Tensorflow with contextual examples. Brief Introduction to TensorFlow TensorFlow is a computational framework created by Google and has become one of the most diffused deep-learning toolkits. It can work with both CPUs and GPUs and already implements most of the operations and structures required to build and train a complex model. TensorFlow can be installed as a Python package on Linux, Mac, and Windows (with or without GPU support); however, we suggest you follow the instructions provided on the website to avoid common mistakes. The main concept behind TensorFlow is the computational graph or a set of subsequent operations that transform an input batch into the desired output. In the following figure, there's a schematic representation of a graph: Starting from the bottom, we have two input nodes (a and b), a transpose operation (that works on b), a matrix multiplication and a mean reduction. The init block is a separate operation, which is formally part of the graph, but it's not directly connected to any other node; therefore it's autonomous (indeed, it's a global initializer). As this one is only a brief introduction, it's useful to list all of the most important strategic elements needed to work with TensorFlow so as to be able to build a few simple examples that can show the enormous potential of this framework: Graph: This represents the computational structure that connects a generic input batch with the output tensors through a directed network made of operations. It's defined as a tf.Graph() instance and normally used with a Python context Manager. Placeholder: This is a reference to an external variable, which must be explicitly supplied when it's requested for the output of an operation that uses it directly or indirectly. For example, a placeholder can represent a variable x, which is first transformed into its squared value and then summed to a constant value. The output is thenx2+c, which is materialized by passing a concrete value for x. It's defined as a tf.placeholder() instance. Variable: An internal variable used to store values which are updated by the algorithm. For example, a variable can be a vector containing the weights of a logistic regression. It's normally initialized before a training process and automatically modified by the built-in optimizers. It's defined as a tf.Variable() instance. A variable can also be used to store elements which must not be considered during training processes; in this case, it must be declared with the parameter trainable=False Constant: A constant value defined as a tf.constant() instance. Operation: A mathematical operation that can work with placeholders, variables, and constants. For example, the multiplication of two matrices is an operation defined a tf.constant(). Among all operations, gradient calculation is one of the most important. TensorFlow allows determining the gradients starting from a determined point in the computational graph, until the origin or another point that must be logically before it. We're going to see an example of this Operation. Session: This is a sort of wrapper-interface between TensorFlow and our working environment (for example, Python or C++). When the evaluation of a graph is needed, this macro-operation will be managed by a session, which must be fed with all placeholder values and will produce the required outputs using the requested devices. For our purposes, it's not necessary to go deeper into this concept; however, I invite the reader to retrieve further information from the website or from one of the resources listed at the end of this chapter. It's declared as an instance of tf.Session() or, as we're going to do, an instance of tf.InteractiveSession(). This type of session is particularly useful when working with notebooks or shell commands, because it places itself automatically as the default one. Device: A physical computational device, such as a CPU or a GPU. It's declared explicitly through an instance of the class tf.device()and used with a context manager. When the architecture contains more computational devices, it's possible to split the jobs so as to parallelize many operations. If no device is specified, TensorFlow will use the default one (which is the main CPU or a suitable GPU if all the necessary components are installed). Let’s now analyze this with a simple example here about computing gradients: Computing gradients The option to compute the gradients of all output tensors with respect to any connected input or node is one of the most interesting features of TensorFlow because it allows us to create learning algorithms without worrying about the complexity of all transformations. In this example, we first define a linear dataset representing the function f(x) = x in the range (-100, 100): import numpy as np >>> nb_points = 100 >>> X = np.linspace(-nb_points, nb_points, 200, dtype=np.float32) The corresponding plot is shown in the following figure: Now we want to use TensorFlow to compute: The first step is defining a graph: import tensorflow as tf >>> graph = tf.Graph() Within the context of this graph, we can define our input placeholder and other operations: >>> with graph.as_default(): >>> Xt = tf.placeholder(tf.float32, shape=(None, 1), name='x') >>> Y = tf.pow(Xt, 3.0, name='x_3') >>> Yd = tf.gradients(Y, Xt, name='dx') >>> Yd2 = tf.gradients(Yd, Xt, name='d2x') A placeholder is generally defined with a type (first parameter), a shape, and an optional name. We've decided to use a tf.float32 type because this is the only type also supported by GPUs. Selecting shape=(None, 1) means that it's possible to use any bidimensional vectors with the second dimension equal to 1. The first operation computes the third power if Xt is working on all elements. The second operation computes all the gradients of Y with respect to the input placeholder Xt. The last operation will repeat the gradient computation, but in this case, it uses Yd, which is the output of the first gradient operation. We can now pass some concrete data to see the results. The first thing to do is create a session connected to this graph: >>> session = tf.InteractiveSession(graph=graph) By using this session, we ask any computation using the method run(). All the input parameters must be supplied through a feed-dictionary, where the key is the placeholder, while the value is the actual array: >>> X2, dX, d2X = session.run([Y, Yd, Yd2], feed_dict={Xt: X.reshape((nb_points*2, 1))}) We needed to reshape our array to be compliant with the placeholder. The first argument of run() is a list of tensors that we want to be computed. In this case, we need all operation outputs. The plot of each of them is shown in the following figure: As expected, they represent respectively: x3, 3x2, and 6x. Further in the book, we look at a slightly more complex example of Logistic Regression to implement a logistic regression algorithm. Refer to Chapter 14, Brief Introduction to Deep Learning and Tensorflow of Machine Learning Algorithms to read the complete chapter.    
Read more
  • 0
  • 0
  • 38620

article-image-building-classification-system-logistic-regression-opencv
Savia Lobo
28 Nov 2017
7 min read
Save for later

Building a classification system with logistic regression in OpenCV

Savia Lobo
28 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Michael Beyeler titled Machine Learning for OpenCV. The code and related files are available on Github here.[/box] A famous dataset in the world of machine learning is called the Iris dataset. The Iris dataset contains measurements of 150 iris flowers from three different species: setosa, versicolor, and viriginica. These measurements include the length and width of the petals, and the length and width of the sepals, all measured in centimeters: Understanding logistic regression Despite its name, logistic regression can actually be used as a model for classification. It uses a logistic function (or sigmoid) to convert any real-valued input x into a predicted output value ŷ that take values between 0 and 1, as shown in the following figure:         The logistic function Rounding ŷ to the nearest integer effectively classifies the input as belonging either to class 0 or 1. Of course, most often, our problems have more than one input or feature value, x. For example, the Iris dataset provides a total of four features. For the sake of simplicity, let's focus here on the first two features, sepal length—which we will call feature f1—and sepal width—which we will call f2. Using the tricks we learned when talking about linear regression, we know we can express the input x as a linear combination of the two features, f1 and f2: However, in contrast to linear regression, we are not done yet. From the previous section, we know that the sum of products would result in a real-valued, output—but we are interested in a categorical value, zero or one. This is where the logistic function comes in: it acts as a squashing function, σ, that compresses the range of possible output values to the range [0, 1]: [box type="shadow" align="" class="" width=""]Because the output is always between 0 and 1, it can be interpreted as a probability. If we only have a single input variable x, the output value ŷ can be interpreted as the probability of x belonging to class 1.[/box] Now let's apply this knowledge to the Iris dataset! Loading the training data The Iris dataset is included with scikit-learn. We first load all the necessary modules, as we did in our earlier examples: In [1]: import numpy as np ... import cv2 ... from sklearn import datasets ... from sklearn import model_selection ... from sklearn import metrics ... import matplotlib.pyplot as plt ... %matplotlib inline In [2]: plt.style.use('ggplot') Then, loading the dataset is a one-liner: In [3]: iris = datasets.load_iris() This function returns a dictionary we call iris, which contains a bunch of different fields: In [4]: dir(iris) Out[4]: ['DESCR', 'data', 'feature_names', 'target', 'target_names'] Here, all the data points are contained in 'data'. There are 150 data points, each of which has four feature values: In [5]: iris.data.shape Out[5]: (150, 4) These four features correspond to the sepal and petal dimensions mentioned earlier: In [6]: iris.feature_names Out[6]: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] For every data point, we have a class label stored in target: In [7]: iris.target.shape Out[7]: (150,) We can also inspect the class labels, and find that there is a total of three classes: In [8]: np.unique(iris.target) Out[8]: array([0, 1, 2]) Making it a binary classification problem For the sake of simplicity, we want to focus on a binary classification problem for now, where we only have two classes. The easiest way to do this is to discard all data points belonging to a certain class, such as class label 2, by selecting all the rows that do not belong to class 2: In [9]: idx = iris.target != 2 ... data = iris.data[idx].astype(np.float32) ... target = iris.target[idx].astype(np.float32) Inspecting the data Before you get started with setting up a model, it is always a good idea to have a look at the data. We did this earlier for the town map example, so let's continue our streak. Using Matplotlib, we create a scatter plot where the color of each data point corresponds to the class label: In [10]: plt.scatter(data[:, 0], data[:, 1], c=target, cmap=plt.cm.Paired, s=100) ... plt.xlabel(iris.feature_names[0]) ... plt.ylabel(iris.feature_names[1]) Out[10]: <matplotlib.text.Text at 0x23bb5e03eb8> To make plotting easier, we limit ourselves to the first two features (iris.feature_names[0] being the sepal length and iris.feature_names[1] being the sepal width). We can see a nice separation of classes in the following figure: Plotting the first two features of the Iris dataset Splitting the data into training and test sets We learned in the previous chapter that it is essential to keep training and test data separate. We can easily split the data using one of scikit-learn's many helper functions: In [11]: X_train, X_test, y_train, y_test = model_selection.train_test_split( ... data, target, test_size=0.1, random_state=42 ... ) Here we want to split the data into 90 percent training data and 10 percent test data, which we specify with test_size=0.1. By inspecting the return arguments, we note that we ended up with exactly 90 training data points and 10 test data points: In [12]: X_train.shape, y_train.shape Out[12]: ((90, 4), (90,)) In [13]: X_test.shape, y_test.shape Out[13]: ((10, 4), (10,)) Training the classifier Creating a logistic regression classifier involves pretty much the same steps as setting up k- NN: In [14]: lr = cv2.ml.LogisticRegression_create() We then have to specify the desired training method. Here, we can choose cv2.ml.LogisticRegression_BATCH or cv2.ml.LogisticRegression_MINI_BATCH. For now, all we need to know is that we want to update the model after every data point, which can be achieved with the following code: In [15]: lr.setTrainMethod(cv2.ml.LogisticRegression_MINI_BATCH) ... lr.setMiniBatchSize(1) We also want to specify the number of iterations the algorithm should run before it terminates: In [16]: lr.setIterations(100) We can then call the training method of the object (in the exact same way as we did earlier), which will return True upon success: In [17]: lr.train(X_train, cv2.ml.ROW_SAMPLE, y_train) Out[17]: True As we just saw, the goal of the training phase is to find a set of weights that best transform the feature values into an output label. A single data point is given by its four feature values (f0, f1, f2, f3). Since we have four features, we should also get four weights, so that x = w0 f0 + w1 f1 + w2 f2 + w3 f3, and ŷ=σ(x). However, as discussed previously, the algorithm adds an extra weight that acts as an offset or bias, so that x = w0 f0 + w1 f1 + w2 f2 + w3 f3 + w4. We can retrieve these weights as follows: In [18]: lr.get_learnt_thetas() Out[18]: array([[-0.04109113, -0.01968078, -0.16216497, 0.28704911, 0.11945518]], dtype=float32) This means that the input to the logistic function is x = -0.0411 f0 - 0.0197 f1 - 0.162 f2 + 0.287 f3 + 0.119. Then, when we feed in a new data point (f0, f1, f2, f3) that belongs to class 1, the output ŷ=σ(x) should be close to 1. But how well does that actually work? Testing the classifier Let's see for ourselves by calculating the accuracy score on the training set: In [19]: ret, y_pred = lr.predict(X_train) In [20]: metrics.accuracy_score(y_train, y_pred) Out[20]: 1.0 Perfect score! However, this only means that the model was able to perfectly memorize the training dataset. This does not mean that the model would be able to classify a new, unseen data point. For this, we need to check the test dataset: In [21]: ret, y_pred = lr.predict(X_test) ... metrics.accuracy_score(y_test, y_pred) Out[21]: 1.0 Luckily, we get another perfect score! Now we can be sure that the model we built is truly awesome. If you enjoyed building a classifier using logistic regression and would like to learn more machine learning tasks using OpenCV, be sure to check out the book, Machine Learning for OpenCV, where this section originally appears.    
Read more
  • 0
  • 0
  • 21854

article-image-highest-paying-data-science-jobs-2017
Amey Varangaonkar
27 Nov 2017
10 min read
Save for later

Highest Paying Data Science Jobs in 2017

Amey Varangaonkar
27 Nov 2017
10 min read
It is no secret that this is the age of data. More data has been created in the last 2 years than ever before. Within the dumps of data created every second, businesses are looking for useful, action worthy insights which they can use to enhance their processes and thereby increase their revenue and profitability. As a result, the demand for data professionals, who sift through terabytes of data for accurate analysis and extract valuable business insights from it, is now higher than ever before. Think of Data Science as a large tree from which all things related to data branch out - from plain old data management and analysis to Big Data, and more.  Even the recently booming trends in Artificial Intelligence such as machine learning and deep learning are applied in many ways within data science. Data science continues to be a lucrative and growing job market in the recent years, as evidenced by the graph below: Source: Indeed.com In this article, we look at some of the high paying, high trending job roles in the data science domain that you should definitely look out for if you’re considering data science as a serious career opportunity. Let’s get started with the obvious and the most popular role. Data Scientist Dubbed as the sexiest job of the 21st century, data scientists utilize their knowledge of statistics and programming to turn raw data into actionable insights. From identifying the right dataset to cleaning and readying the data for analysis, to gleaning insights from said analysis, data scientists communicate the results of their findings to the decision makers. They also act as advisors to executives and managers by explaining how the data affects a particular product or process within the business so that appropriate actions can be taken by them. Per Salary.com, the median annual salary for the role of a data scientist today is $122,258, with a range between $106,529 to $137,037. The salary is also accompanied by a whole host of benefits and perks which vary from one organization to the other, making this job one of the best and the most in-demand, in the job market today. This is a clear testament to the fact that an increasing number of businesses are now taking the value of data seriously, and want the best talent to help them extract that value. There are over 20,000 jobs listed for the role of data scientist, and the demand is only growing. Source: Indeed.com To become a data scientist, you require a bachelor’s or a master’s degree in mathematics or statistics and work experience of more than 5 years in a related field. You will need to possess a unique combination of technical and analytical skills to understand the problem statement and propose the best solution, good programming skills to develop effective data models, and visualization skills to communicate your findings with the decision makers. Interesting in becoming a data scientist? Here are some resources to help you get started: Principles of Data Science Data Science Algorithms in a Week Getting Started with R for Data Science [Video] For a more comprehensive learning experience, check out our skill plan for becoming a data scientist on Mapt, our premium skills development platform. Data Analyst Probably a term you are quite familiar with, Data Analysts are responsible for crunching large amounts of data and analyze it to come to appropriate logical conclusions. Whether it’s related to pure research or working with domain-specific data, a data analyst’s job is to help the decision-makers’ job easier by giving them useful insights. Effective data management, analyzing data, and reporting results are some of the common tasks associated with this role. How is this role different than a data scientist, you might ask. While data scientists specialize in maths, statistics and predictive analytics for better decision making, data analysts specialize in the tools and components of data architecture for better analysis. Per Salary.com, the median annual salary for an entry-level data analyst is $55,804, and the range usually falls between $50,063 to $63,364 excluding bonuses and benefits. For more experienced data analysts, this figure rises to around a mean annual salary of $88,532. With over 83,000 jobs listed on Indeed.com, this is one of the most popular job roles in the data science community today. This profile requires a pretty low starting point, and is justified by the low starting salary packages. As you gain more experience, you can move up the ladder and look at becoming a data scientist or a data engineer. Source: Indeed.com You may also come across terms such as business data analyst or simply business analyst which are sometimes interchangeably used with the role of a data analyst. While their primary responsibilities are centered around data crunching, business analysts model company infrastructure, while data analysts model business data structures. You can find more information related to the differences in this interesting article. If becoming a data analyst is something that interests you, here are some very good starting points: Data Analysis with R Python Data Analysis, Second Edition Learning Python Data Analysis [Video] Data Architect Data architects are responsible for creating a solid data management blueprint for an organization. They are primarily responsible for designing the data architecture and defining how data is stored, consumed and managed by different applications and departments within the organization. Because of these critical responsibilities, a data architect’s job is a very well-paid one. Per Salary.com, the median annual salary for an entry-level data architect is $74,809, with a range between $57,964 to $91,685. For senior-level data architects, the median annual salary rises up to $136,856, with a range usually between $121,969 to $159,212. These high figures are justified by the critical nature of the role of a data architect - planning and designing the right data infrastructure after understanding the business considerations to get the most value out of the data. At present, there are over 23,000 jobs for the role listed on Indeed.com, with a stable trend in job seeker interest, as shown: Source: Indeed.com To become a data architect, you need a bachelor’s degree in computer science, mathematics, statistics or a related field, and loads of real-world skills to qualify for even the entry-level positions. Technical skills such as statistical modeling, knowledge of languages such as Python and R, database architectures, Hadoop-based skills, knowledge of NoSQL databases, and some machine learning and data mining are required to become a data architect. You also need strong collaborative skills, problem-solving, creativity and the ability to think on your feet, to solve the trickiest of problems on the go. Suffice to say it’s not an easy job, but it is definitely a lucrative one! Get ahead of the curve, and start your journey to becoming a data architect now: Big Data Analytics Hadoop Blueprints PostgreSQL for Data Architects Data Engineer Data engineers or Big Data engineers are a crucial part of the organizational workforce and work in tandem with data architects and data scientists to ensure appropriate data management systems are deployed and the right kind of data is being used for analysis. They deal with messy, unstructured Big Data and strive to provide clean, usable data to the other teams within the organization. They build high-performance analytics pipelines and develop set of processes for efficient data mining. In many companies, the role of a data engineer is closely associated with that of a data architect. While an architect is responsible for the planning and designing stages of the data infrastructure project, a data engineer looks after the construction, testing and maintenance of the infrastructure. As such data engineers tend to have a more in-depth understanding of different data tools and languages than data architects. There are over 90,000 jobs listed on Indeed.com, suggesting there is a very high demand in the organizations for this kind of a role. An entry level data engineer has a median annual salary of $90,083 per Payscale.com, with a range of $60,857 to $131,851. For Senior Data Engineers, the average salary shoots up to $123,749 as per Glassdoor estimates. Source: Indeed.com With the unimaginable rise in the sheer volume of data, the onus is on the data engineers to build the right systems that empower the data analysts and data scientists to sift through the messy data and derive actionable insights from it. If becoming a data engineer is something that interests you, here are some of our products you might want to look at: Real-Time Big Data Analytics Apache Spark 2.x Cookbook Big Data Visualization You can also check out our detailed skill plan on becoming a Big Data Engineer on Mapt. Chief Data Officer There is a countless number of organizations that build their businesses on data, but don’t manage it that well. This is where a senior executive popularly known as the Chief Data Officer (CDO) comes into play - bearing the responsibility for implementing the organization’s data and information governance and assisting with data-driven business strategies. They are primarily responsible for ensuring that their organization gets the most value out of their data and put appropriate plans in place for effective data quality and its life-cycle management. The role of a CDO is one of the most lucrative and highest paying jobs in the data science frontier. An average median annual pay for a CDO per Payscale.com is around $192,812. Indeed.com lists just over 8000 job postings too - this is not a very large number, but understandable considering the recent emergence of the role and because it’s a high profile, C-suite job. Source: Indeed.com According to a Gartner research, almost 50% companies in a variety of regulated industries will have a CDO in place, by 2017. Considering the demand for the role and the fact that it is only going to rise in the future, the role of a CDO is one worth vying for.   To become a CDO, you will obviously need a solid understanding of statistical, mathematical and analytical concepts. Not just that, extensive and high-value experience in managing technical teams and information management solutions is also a prerequisite. Along with a thorough understanding of the various Big Data tools and technologies, you will need strong communication skills and deep understanding of the business. If you’re planning to know more about how you can become a Chief Data Officer, you can browse through our piece on the role of CDO. Why demand for data science professionals will rise It’s hard to imagine an organization which doesn’t have to deal with data, but it’s harder to imagine the state of an organization with petabytes of data and not knowing what to do with it. With the vast amounts of data, organizations deal with these days, the need for experts who know how to handle the data and derive relevant and timely insights from it is higher than ever. In fact, IBM predicts there’s going to be a severe shortage of data science professionals, and thereby, a tremendous growth in terms of job offers and advertised openings, by 2020. Not everyone is equipped with the technical skills and know-how associated with tasks such as data mining, machine learning and more. This is slowly creating a massive void in terms of talent that organizations are looking to fill quickly, by offering lucrative salaries and added benefits. Without the professional expertise to turn data into actionable insights, Big Data becomes all but useless.      
Read more
  • 0
  • 0
  • 26415

article-image-learn-to-build-a-scatterplot-in-ibm-spss
Kartikey Pandey
27 Nov 2017
4 min read
Save for later

How to build a Scatterplot in IBM SPSS

Kartikey Pandey
27 Nov 2017
4 min read
[box type="note" align="" class="" width=""] ----The following excerpt is from the title Data Analysis with IBM SPSS Statistics, Chapter 5, written by Kenneth Stehlik-Barry and Anthony J. Babinec. Analytical tools such as SPSS can readily provide even a novice user with an overwhelming amount of information and a broad range of options for analyzing patterns in the data. [/box] In this article we help you learn the techniques of SPSS to build Scatterplot using the Chart Builder feature. One of the most valuable methods for examining the relationship between two variables containing scale-level data is a scatterplot. In the previous chapter, scatterplots were used to detect points that deviated from the typical pattern--multivariate outliers. To produce a similar scatterplot using two fields from the 2016 General Social Survey data, navigate to Graphs | Chart Builder. An information box is displayed indicating that each field's measurement properties will be used to identify the types of graphs available so adjusting these properties is advisable. In this example, the properties will be modified as a part of the graph specification process but you may want to alter the properties of some variables permanently so that they don't need to be changed for each use. For now, just select OK to move ahead. In the main Chart Builder window, select Scatter/Dot from the menu at the lower left, double-click on the first graph to the right (Simple Scatter) to place it in the preview pane at the upper right, and then right-click on the first field labeled HIGHEST YEAR OF SCHOOL. Change this variable from Nominal to Scale, as shown in the following Screenshot: After changing the respondent's education to Scale, drag this field to the X-Axis location in the preview pane and drag spouse's education to the Y-Axis location. Once both elements are in place, the OK choice will become available. Select it to produce the scatterplot in the following screenshot: The scatterplot produced by default provides some sense of the trend in that the denser circles are concentrated in a band from the lower left to the upper right. This pattern, however, is rather subtle visually. With some editing, the relationship can be made more Evident. Double-click on the graph to open the Chart Editor and select the X icon at the top and change the major increment to 4 so that there are numbers corresponding to completing high school and college. Do the same for the y-axis values. Select a point on the graph to highlight all the "dots" and right-click to display the following dialog. Click on the Marker tab and change the symbol to the star shape, increase the size to 6, increase the border to 2, and change the border color to a dark blue. Use Apply to make the changes visible on the scatterplot: Use the Add Fit line at Total icon above the graph to show the regression line for this data. Drag the R2 box from the upper right to the bottom, below the graph and drag the box on the graphs with the equation displayed to the lower left away from the points: The modifications to the original scatterplot make it easier to see the pattern since the “stars” near the line are darker and denser than those farther from the line indicating fewer cases are associated with those points. The SPSS capabilities with respect to scatterplot in this article will give you a foundation to create a visual representation of data for both deeper pattern discovery and to communicate results to a broader audience. Several other graph types such as pie charts and multiple line charts  can be built and edited using the approach shown in Chapter 5, Visually Exploring the Data from our title Data Analysis with IBM SPSS Statistics. Go on, explore these alternative graph styles to see when they may be better suited to your needs.  
Read more
  • 0
  • 0
  • 22902

article-image-mid-autumn-shoppers-dream-amazon-fulfilled-thanksgiving-look-like
Aaron Lazar
24 Nov 2017
10 min read
Save for later

A mid-autumn Shopper’s dream - What an Amazon fulfilled Thanksgiving would look like

Aaron Lazar
24 Nov 2017
10 min read
I’d been preparing for Thanksgiving a good 3 weeks in advance. One reason is that I’d recently rented out a new apartment and the troops were heading over to my place this year. I obviously had to make sure everything went well and for that, trust me, there was no resting even for a minute! Thanksgiving is really about being thankful for the people and things in your life and spending quality time with family. This Thanksgiving I’m especially grateful to Amazon for making it the best experience ever! Read on to find out how Amazon made things awesome! Good times started two weeks ago when I was at the AmazonGo store with my friend, Sue. [embed]https://www.youtube.com/watch?v=NrmMk1Myrxc[/embed] In fact, this was the first time I had set foot in one of the stores. I wanted to see what was so cool about them and why everyone had been talking about them for so long! The store was pretty big and lived up to the A to Z concept, as far as I could see. The only odd thing was that I didn’t notice any queues or a billing counter. Sue glided around the floor with ease, as if she did this every day. I was more interested in seeing what was so special about this place. After she got her stuff, she headed straight for the door. I smiled to myself thinking how absent minded she was. So I called her back and reminded her “You haven’t gotten your products billed.” She smiled back at me and shrugged, “I don’t need to.” Before I could open my mouth to tell her off for stealing, she explained to me about the store. It’s something totally futuristic! Have you ever imagined not having to stand in a line to buy groceries? At the store, you just had to log in to your AmazonGo app on your phone, enter the store, grab your stuff and then leave. The sensors installed everywhere in the store automatically detected what you’d picked up and would bill you accordingly. They also used Computer Vision and Deep Learning to track people and their shopping carts. Now that’s something! And you even got a receipt! Well, it was my birthday last week and knowing what an avid reader I was, my colleagues from office gifted me a brand new Kindle. I loved every bit of it, but the best part was the X-ray feature. With X-ray, you could simply get information about a character, person or terms in a book. You could also scroll through the long lists of excerpts and click on one to go directly to that particular portion of the book! That’s really amazing, especially if you want to read a particular part of the book quickly. It came in use at the right time - I downloaded a load of recipe books for the turkey. Another feather in the cap for Amazon! Talking about feathers in one’s cap, you won’t believe it, but Amazon actually got me rekognised at work a few days ago. Nah, that wasn’t a typo. I worked as a software developer/ML engineer in a startup and I’d been doing this for as long as I can remember. I recently built this cool mobile application that recognized faces and unlocked your phone even when you didn’t have something like Face ID on your phone and the app had gotten us a million downloads in a month! It could also recognize and give you information about the ethnicity of a person if you captured their photograph with the phone’s camera. The trick was that I’d used the AmazonRekognition APIs for enhanced face detection in the application. Rekognition allows you to detect objects, scenes, text, and faces, using highly scalable, deep learning models. I also enhanced the application using the Polly API. Polly converts text to whichever language you want the speech in and gives you the synthesized speech in the form of audio files.The app I built now converted input text into 18 different languages, helping one converse with the person in front of them in that particular language, should they have a problem doing it in English. I got that long awaited promotion right after! Ever wondered how I got the new apartment? ;) Since the folks were coming over to my place in a few days, I thought I’d get a new dinner set. You’d probably think I would need to sit down at my PC or probably pick up my phone to search for a set online, but I had better things to do. Thanks to Alexa, I simply needed to ask her to find one for me and she did it brilliantly. Now Alexa isn’t my girlfriend, although I would have loved that to be. Alexa is actually Amazon’s cloud-based voice service that provides customers with an engaging way of interacting with technology. Alexa is blessed with finely tuned ASR or Automatic Speech Recognition and NLU or Natural Language Understanding engines, that instantly recognize and respond to voice requests. I selected a pretty looking set and instantly bought it through my Prime account. With technology like this at my fingertips, the developer in me had left no time in exploring possibilities with Alexa. That’s when I found out about Lex, built on the same deep learning platform that Alexa works on, which allows developers to build conversational interfaces into their apps. With the dinner set out of the way, I sat back with my feet up on the table. I was awesome, baby! Oh crap! I forgot to buy the turkey, the potatoes, the wine and a whole load of other stuff. It was 3 AM and I started panicking. I remembered that mum always put the turkey in the fridge at least 3 days in advance. I had only 2! I didn’t even have the time to make it to the AmazonGo store. I was panicking again and called up Suzy to ask her if she could pick up the stuff for me. She sounded so calm over the phone when I narrated my horror to her. She simply told me to get the stuff from AmazonFresh. So I hastily disconnected the call and almost screamed to Alexa, “Alexa, find me a big bird!”, and before I realized what I had said, I was presented with this. [caption id="attachment_2215" align="aligncenter" width="184"] Big Bird is one of the main protagonist in Sesame Street.[/caption] So I tried again, this time specifying what I actually needed! With AmazonDash integrating with AmazonFresh, I was able to get the turkey and other groceries delivered home in no time! What a blessing, indeed! A day before Thanksgiving, I was stuck in the office, working late on a new project. We usually tinkered around with a lot of ML and AI stuff. There was this project which needed the team to come up with a really innovative algorithm to perform a deep learning task. As the project lead, I was responsible for choosing the tech stack and I’m glad a little birdie had recently told me about AWS taking in MXNet officially as a Deep Learning Framework. MXNet made it a breeze to build ML applications that train quickly and could run anywhere. Moreover, with the recent collaboration between Amazon and Microsoft, a new ML library called Gluon was born. Available in MXNet, Gluon made building ML models, even more, easier and quicker, without compromising on performance. Need I say the project was successful? I got home that evening and sat down to pick a good flick or two to download from Amazon PrimeVideo. There’s always someone in the family who’d suggest we all watch a movie and I had to be prepared. With that done I quickly showered and got to bed. It was going to be a long day the next day! 4 AM my alarm rang and I was up! It was Thanksgiving, and what a wonderful day it was! I quickly got ready and prepared to start cooking. I got the bird out of the freezer and started to thaw it in cold water. It was a big bird so it was going to take some time. In the meantime, I cleaned up the house and then started working on the dressing. Apples, sausages, and cranberry. Yum! As I sliced up the sausages I realized that I had misjudged the quantity. I needed to get a couple more packets immediately! I had to run to the grocery store right away or there would be a disaster! But it took me a few minutes to remember it was Thanksgiving, one of the craziest days to get out on the road. I could call the store delivery guy or probably Amazon Dash, but then that would be illogical cos he’d have to take the same congested roads to get home.  I turned to Alexa for help, “Alexa, how do I get sausages delivered home in the next 30 minutes?”. And there I got my answer - Try Amazon PrimeAir. Now I don’t know about you, but having a drone deliver a couple packs of sausages to my house, is nothing less than ecstatic! I sat it out near the window for the next 20 minutes, praying that the package wouldn’t be intercepted by some hungry birds! I couldn’t miss the sight of the pork flying towards my apartment. With the dressing and turkey baked and ready, things were shaping up much better than I had expected. The folks started rolling in by lunchtime. Mum and dad were both quite impressed with the way I had organized things. I was beaming and in my mind hi-fived Amazon for helping me make everything possible with its amazing products and services designed to delight customers. It truly lives up to its slogan: Work hard. Have fun. Make history. If you are one of those folks who do this every day, behind the scenes, by building amazing products powered by machine learning and big data to make other's lives better, I want to thank you today for all your hard work. This Thanksgiving weekend, Packt's offering an unbelievable deal - Buy any book or video for just $10 or any three for $25! I know what I have my eyes on! Python Machine Learning - Second Edition by Sebastian Raschka and Vahid Mirjalili Effective Amazon Machine Learning by Alexis Perrier OpenCV 3 - Advanced Image Detection and Reconstruction [Video] by Prof. Robert Laganiere In the end, there’s nothing better than spending quality time with your family, enjoying a sumptuous meal, watching smiles all around and just being thankful for all you have. All I could say was, this Thanksgiving was truly Amazon fulfilled! :) Happy Thanksgiving folks!    
Read more
  • 0
  • 0
  • 11743

article-image-3d-graphics-animation-r
Savia Lobo
23 Nov 2017
4 min read
Save for later

How to create 3D Graphics and Animation in R

Savia Lobo
23 Nov 2017
4 min read
[box type="note" align="alignright" class="" width=""] The article is originally extracted from our book R Data Analysis Cookbook - Second Edition by Kuntal Ganguly. Data analytics with R has emerged to be a very important focus for organizations of all kind. The book will show how you can put your data analysis skills in R to practical use. It contains recipes catering to basic as well as advanced data analysis tasks. [/box] In this article, we have captured one aspect of using R for the creation of 3D graphics and animation. When, a two-dimensional view is not sufficient to understand and analyze the data; besides the x, y variable, an additional data dimension can be represented by a color variable. Our article gives a step by step explanation on how to plot 3D package in R is used to visualize three-dimensional graphs. Codes and data files are readily available for download towards the end of the post. Get set ready Make sure you have downloaded the code and the mtcars.csv file is located in the working directory of R. Install and load the latest version of the plot3D package: > install.packages("plot3D") > library(plot3D) How to do it... Load the mtcars dataset and preprocess it to add row names and remove the model name column: > mtcars=read.csv("mtcars.csv") > rownames(mtcars) <- mtcars$X > mtcars$X=NULL > head(mtcars) 2. Next, create a three-dimensional scatter plot: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, clab = c("Miles/(US) gallon")) 3. Now, add title and axis labels to the scatter plot: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, pch = 18, theta = 20, phi = 20, main = "Motor Trend Car Road Tests", xlab = "Weight lbs",ylab ="Displacement (cu.in.)", zlab = "Miles gallon") Rotating a 3D plot can provide a complete view of the data. Now view the plot in different directions by altering the values of two attributes--theta and phi: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg,clab = c("Cars Mileage"),theta = 15, phi = 0, bty ="g") How it works... The scatter3D() function from the plot3D package has the following parameters: x, y, z: Vectors of point coordinates colvar: A variable used for the coloring col: A color palette to color the colvar variable labels: Refers to the text to be written add: Logical; if TRUE, then the points will be added to the current plot, and if FALSE, a new plot is started pch: Shape of the points cex: Size of the points theta: The azimuthal direction phi: Co-latitude; both theta and phi can be used to define the angles for the viewing direction Bty: Refers to the type of enclosing box and can take various values such as f-full box, b-default value (back panels only), g- grey background with white grid lines, and bl- black background There's more… The following concepts are very important for this recipe. Adding text to an existing 3D plot We can use the text3D() function to add text based on the car model name, alongside the data points: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, phi = 0, bty = "g", pch = 20, cex = 0.5) > text3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg, labels = rownames(mtcars), add = TRUE, colkey = FALSE, cex = 0.5) Using a 3D histogram The three-dimensional histogram function, hist3D(), has the following attributes: z: Values contained within a matrix. x, y: Vectors, where the length of x should be equal to nrow(z) and length of y should be equal to ncol(z). colvar: Variable used for the coloring and has the same dimension as z. col: Color palette used for the colvar variable. By default, a red-yellow-blue color scheme. add: Logical variable. If TRUE, adds surfaces to the current plot. If FALSE, starts a new plot. Let's plot the Death rate of Virgina using a 3D histogram: data(VADeaths) hist3D(z = VADeaths, scale = FALSE, expand = 0.01, bty = "g", phi = 20, col = "#0085C2", border = "black", shade = 0.2, ltheta = 80, space = 0.3, ticktype = "detailed", d = 2) Using a line graph To visualize the plot with a line graph, add the type parameter to the scatter3D() function. The type parameter can take the values l(only line), b(both line and point), and h(horizontal line and points both). The 3D plot with a horizontal line and plot: > scatter3D(x=mtcars$wt, y=mtcars$disp, z=mtcars$mpg,type="h", clab = c("Miles/(US) gallon")) [box type="download" align="" class="" width=""]Download code files here. [/box] If you think the recipe is delectable, you should definitely take a look at R data Analysis Cookbook- Second edition which will enlighten you more on data visualizations with R.  
Read more
  • 0
  • 1
  • 15553
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-training-rnns-time-series-forecasting
Aaron Lazar
22 Nov 2017
7 min read
Save for later

Training RNNs for Time Series Forecasting

Aaron Lazar
22 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This tutorial has been taken from the book Practical Time Series Analysis by Dr. PKS Prakash and Dr. Avishek Pal.[/box] RNNs are notoriously difficult to be trained. Vanilla RNNs suffer from vanishing and exploding gradients that give erratic results during training. As a result, RNNs have difficulty in learning long-range dependencies. For time series forecasting, going too many timesteps back in the past would be problematic. To address this problem, Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU), which are special types of RNNs, have been introduced. We will use LSTM and GRU to develop the time series forecasting models. Before that, let's review how RNNs are trained using Backpropagation Through Time (BPTT), a variant of the backpropagation algorithm. We will find out how vanishing and exploding gradients arise during BPTT. Let's consider the computational graph of the RNN that we have been using for the time series forecasting. The gradient computation is shown in the following figure: Figure: Back-Propagation Through Time for deep recurrent neural network with p timesteps For weights U, there is one path to compute the partial derivative, which is given as follows: However, due to the sequential structure of the RNN, there are multiple paths connecting the weights and the loss and the loss. Hence, the partial derivative is the sum of partial derivatives along the individual paths that start at the loss node and ends at every timestep node in the computational graph: The technique of computing the gradients for the weights by summing over paths connecting the loss node and every timestep node is backpropagation through time, which is a special case of the original backpropagation algorithm. The problem of vanishing gradients in long-range RNNs is due to the multiplicative terms in the BPTT gradient computations. Now let's examine a multiplicative term from one of the preceding equations. The gradients along the computation path connecting the loss node and ith timestep is This chain of gradient multiplication is ominously long to model long-range dependencies and this is where the problem of vanishing gradient arises. The activation function for the internal state [0,1) is either a tanh or sigmoid. The first derivative of tanh is which is bound in (0,¼]. In the sigmoid function, the first-order derivative is which is bound in si . Hence, the gradients W  are positive fractions. For long-range timesteps, multiplying these fractional gradients diminishes the final product to zero and there is no gradient flow from a long-range timestep. Due to the negligibly low values of the gradients, the weights do not update and hence the neurons are said to be saturated. It is noteworthy that U, ∂si/∂s(i-1), and ∂si/∂W are matrices and therefore the partial derivatives t and st are computed on matrices. The final output is computed through matrix multiplications and additions. The first derivative on matrices is called Jacobian. If any one element of the Jacobian matrix is a fraction, then for a long-range RNN, we would see a vanishing gradient. On the other hand, if an element of the Jacobian is greater than one, the training process suffers from exploding gradients. Solving the long-range dependency problem We have seen in the previous section that it is difficult for vanilla RNNs to effectively learn long-range dependencies due to vanishing and exploding gradients. To address this issue, Long Short Term Memory network was developed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. Gated Recurrent Unit was introduced in 2014 and gives a simpler version of LSTM. Let's review how LSTM and GRU solve the problem of learning long-range dependencies. Long Short Term Memory (LSTM) LSTM introduces additional computations in each timestep. However, it can still be treated as a black box unit that, for timestep (ht), returns the state internal (ft) and this is forwarded to the next timestep. However, internally, these vectors are computed differently. LSTM introduces three new gates: the input (ot) forget (gt), and output (ct) gates. Every timestep also has an internal hidden state (ht) and internal memory These new units are computed as follows: Now, let's understand the computations. The gates are generated through sigmoid activations ht that limits their values within ft. Hence, these act as gates by letting out only a fraction of the value when multiplied by another variable. The input gate ot controls the fraction of the newly computed input to keep. The forget gate ct determines the effect of the previous time step and the output gate ft controls how much of the internal state to let out. The internal hidden state is calculated from the input to the current timestep and output of the previous timestep. Note that this is the same as computing the internal state in a vanilla RNN. ht is the internal memory unit of the current timestep and considers the memory from the previous step but downsized by a fraction st and the effect of the internal hidden state but mixed with the input gate ot. Finally,  zt would be passed to the next timestep and computed from the current internal memory and the output gate rt. The input, forget, and output gates are used to selectively include the previous memory and the current hidden state that is computed in the same manner as in vanilla RNNs. The gating mechanism of LSTM allows memory transfer from over long-range timesteps. Gated Recurrent Units (GRU) GRU is simpler than LSTM and has only two internal gates, namely, the update gate (zt) and the reset gate (rt). The computations of the update and reset gates are as follows: zt = σ(Wzxt + Uzst-1) rt = σ(Wrxt + Urst-1) The state st of the timestep t is computed using the input xt, state st-1 from the previous timestep, the update, and the reset gates: The update being computed by a sigmoid function determines how much of the previous step's memory is to be retained in the current timestep. The reset gate controls how to combine the previous memory with the current step's input. Compared to LSTM, which has three gates, GRU has two. It does not have the output gate and the internal memory, both of which are present in LSTM. The update gate in GRU determines how to combine the previous memory with the current memory and combines the functionality achieved by the input and forget gates of LSTM. The reset gate, which combines the effect of the previous memory and the current input, is directly applied to the previous memory. Despite a few differences in how memory is transmitted along the sequence, the gating mechanisms in both LSTM and GRU are meant to learn long-range dependencies in data. So, which one do we use - LSTM or GRU? Both LSTM and GRU are capable of handling memory over long RNNs. However, a common question is which one to use? LSTM has been long preferred as the first choice for language models, which is evident from their extensive use in language translation, text generation, and sentiment classification. GRU has the distinct advantage of fewer trainable weights as compared to LSTM. It has been applied to tasks where LSTM has previously dominated. However, empirical studies show that neither approach outperforms the other in all tasks. Tuning model hyperparameters such as the dimensionality of the hidden units improves the predictions of both. A common rule of thumb is to use GRU in cases having less training data as it requires less number of trainable weights. LSTM proves to be effective in case of large datasets such as the ones used to develop language translation models. This article is an excerpt from the book Practical Time Series Analysis. For more techniques, grab the book now!
Read more
  • 0
  • 0
  • 20870

article-image-implementing-k-nearest-neighbors-algorithm-python
Aaron Lazar
17 Nov 2017
7 min read
Save for later

Implementing the k-nearest neighbors algorithm in Python

Aaron Lazar
17 Nov 2017
7 min read
[box type="note" align="" class="" width=""]The following is an excerpt from Dávid Natingga's Data Science Algorithms in a Week. [/box] The nearest neighbor algorithm classifies a data instance based on its neighbors. The class of a data instance determined by the k-nearest neighbor algorithm is the class with the highest representation among the k-closest neighbors. In this short tutorial, we will cover the basics of the k-NN algorithm - understanding it and its implementation with a simple example: Mary and her temperature preferences. So let’s get right to it, shall we? Mary and her temperature preferences As an example, if we know that our friend Mary feels cold when it is 10 degrees Celsius, but warm when it is 25 degrees Celsius, then in a room where it is 22 degrees Celsius, the nearest neighbor algorithm would guess that our friend would feel warm, because 22 is closer to 25 than to 10. Suppose we would like to know when Mary feels warm and when she feels cold, as in the previous example, but in addition, wind speed data is also available when Mary was asked if she felt warm or cold: We could represent the data in a graph, as follows: Now, suppose we would like to find out how Mary feels at the temperature 16 degrees Celsius with a wind speed of 3km/h using the 1-NN algorithm: For simplicity, we will use a Manhattan metric to measure the distance between the neighbors on the grid. The Manhattan distance dMan of a neighbor N1=(x1,y1) from the neighbor N2=(x2,y2) is defined to be dMan=|x1- x2|+|y1- y2|. Let us label the grid with distances around the neighbors to see which neighbor with a known class is closest to the point we would like to classify: We can see that the closest neighbor with a known class is the one with the temperature 15 (blue) degrees Celsius and the wind speed 5km/h. Its distance from the questioned point is three units. Its class is blue (cold). The closest red (warm) neighbor is away four units from the questioned point. Since we are using the 1-nearest neighbor algorithm, we just look at the closest neighbor and, therefore, the class of the questioned point should be blue (cold). By applying this procedure to every data point, we can complete the graph as follows: Note that sometimes a data point can be distanced from two known classes with the same distance: for example, 20 degrees Celsius and 6km/h. In such situations, we would prefer one class over the other or ignore these boundary cases. The actual result depends on the specific implementation of an algorithm. Implementation of k-nearest neighbors algorithm in Python We’ll implement the k-NN algorithm in Python to find Mary's temperature preference: # source_code/1/mary_and_temperature_preferences/knn_to_data.py # Applies the knn algorithm to the input data. # The input text file is assumed to be of the format with one line per # every data entry consisting of the temperature in degrees Celsius, # wind speed and then the classification cold/warm. import sys sys.path.append('..') sys.path.append('../../common') import knn # noqa import common # noqa # Program start # E.g. "mary_and_temperature_preferences.data" input_file = sys.argv[1] # E.g. "mary_and_temperature_preferences_completed.data" output_file = sys.argv[2] k = int(sys.argv[3]) x_from = int(sys.argv[4]) x_to = int(sys.argv[5]) y_from = int(sys.argv[6]) y_to = int(sys.argv[7]) data = common.load_3row_data_to_dic(input_file) new_data = knn.knn_to_2d_data(data, x_from, x_to, y_from, y_to, k) common.save_3row_data_from_dic(output_file, new_data) # source_code/common/common.py # ***Library with common routines and functions*** def dic_inc(dic, key): if key is None: Pass if dic.get(key, None) is None: dic[key] = 1 Else: dic[key] = dic[key] + 1 # source_code/1/knn.py # ***Library implementing knn algorihtm*** def info_reset(info): info['nbhd_count'] = 0 info['class_count'] = {} # Find the class of a neighbor with the coordinates x,y. # If the class is known count that neighbor. def info_add(info, data, x, y): group = data.get((x, y), None) common.dic_inc(info['class_count'], group) info['nbhd_count'] += int(group is not None) # Apply knn algorithm to the 2d data using the k-nearest neighbors with # the Manhattan distance. # The dictionary data comes in the form with keys being 2d coordinates # and the values being the class. # x,y are integer coordinates for the 2d data with the range # [x_from,x_to] x [y_from,y_to]. def knn_to_2d_data(data, x_from, x_to, y_from, y_to, k): new_data = {} info = {} # Go through every point in an integer coordinate system. for y in range(y_from, y_to + 1): for x in range(x_from, x_to + 1): info_reset(info) # Count the number of neighbors for each class group for # every distance dist starting at 0 until at least k # neighbors with known classes are found. for dist in range(0, x_to - x_from + y_to - y_from): # Count all neighbors that are distanced dist from # the point [x,y]. if dist == 0: info_add(info, data, x, y) Else: for i in range(0, dist + 1): info_add(info, data, x - i, y + dist - i) info_add(info, data, x + dist - i, y - i) for i in range(1, dist): info_add(info, data, x + i, y + dist - i) info_add(info, data, x - dist + i, y - i) # There could be more than k-closest neighbors if the # distance of more of them is the same from the point # [x,y]. But immediately when we have at least k of # them, we break from the loop. if info['nbhd_count'] >= k: Break class_max_count = None # Choose the class with the highest count of the neighbors # from among the k-closest neighbors. for group, count in info['class_count'].items(): if group is not None and (class_max_count is None or count > info['class_count'][class_max_count]): class_max_count = group new_data[x, y] = class_max_count return new_data Input: The program above will use the file below as the source of the input data. The file contains the table with the known data about Mary's temperature preferences: # source_code/1/mary_and_temperature_preferences/ marry_and_temperature_preferences.data 10 0 cold 25 0 warm 15 5 cold 20 3 warm 18 7 cold 20 10 cold 22 5 warm 24 6 warm Output: We run the implementation above on the input file mary_and_temperature_preferences.data using the k-NN algorithm for k=1 neighbors. The algorithm classifies all the points with the integer coordinates in the rectangle with a size of (30-5=25) by (10-0=10), so with the a of (25+1) * (10+1) = 286 integer points (adding one to count points on boundaries). Using the wc command, we find out that the output file contains exactly 286 lines - one data item per point. Using the head command, we display the first 10 lines from the output file. We visualize all the data from the output file in the next section: $ python knn_to_data.py mary_and_temperature_preferences.data mary_and_temperature_preferences_completed.data 1 5 30 0 10 $ wc -l mary_and_temperature_preferences_completed.data 286 mary_and_temperature_preferences_completed.data $ head -10 mary_and_temperature_preferences_completed.data 7 3 cold 6 9 cold 12 1 cold 16 6 cold 16 9 cold 14 4 cold 13 4 cold 19 4 warm 18 4 cold 15 1 cold So, there you have it! The K Nearest Neighbors algorithm explained and implemented in Python. I hope you enjoyed this tutorial and found it interesting. If you want more, go ahead and purchase Dávid Natingga's Data Science Algorithms in a Week, from which the tutorial has been extracted.
Read more
  • 0
  • 0
  • 82509

article-image-visualizing-3d-plots-matplotlib-2-0
Sugandha Lahoti
16 Nov 2017
7 min read
Save for later

Visualizing 3D plots in Matplotlib 2.0

Sugandha Lahoti
16 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim titled Matplotlib 2.x By Example.[/box] By transitioning to the three-dimensional space, you may enjoy greater creative freedom when creating visualizations. The extra dimension can also accommodate more information in a single plot. However, some may argue that 3D is nothing more than a visual gimmick when projected to a 2D surface (such as paper) as it would obfuscate the interpretation of data points. In Matplotlib version 2, despite significant developments in the 3D API, annoying bugs or glitches still exist. We will discuss some workarounds toward the end of this article. More powerful Python 3D visualization packages do exist (such as MayaVi2, Plotly, and VisPy), but it's good to use Matplotlib's 3D plotting functions if you want to use the same package for both 2D and 3D plots, or you would like to maintain the aesthetics of its 2D plots. For the most part, 3D plots in Matplotlib have similar structures to 2D plots. As such, we will not go through every 3D plot type in this section. We will put our focus on 3D scatter plots and bar charts. 3D scatter plot Let's try to create a 3D scatter plot. Before doing that, we need some data points in three dimensions (x, y, z): import pandas as pd source = "https://raw.githubusercontent.com/PointCloudLibrary/data/master/tutorials/ ism_train_cat.pcd" cat_df = pd.read_csv(source, skiprows=11, delimiter=" ", names=["x","y","z"], encoding='latin_1') cat_df.head() To declare a 3D plot, we first need to import the Axes3D object from the mplot3d extension in mpl_toolkits, which is responsible for rendering 3D plots in a 2D plane. After that, we need to specify projection='3d' when we create subplots: from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(cat_df.x, cat_df.y, cat_df.z) plt.show() Behold, the mighty sCATter plot in 3D. Cats are currently taking over the internet. According to the New York Times, cats are "the essential building block of the Internet" (https://www.nytimes.com/2014/07/23/upshot/what-the-internet-can-see-from-your-cat-pictures.html). Undoubtedly, they deserve a place in this chapter as well. Contrary to the 2D version of scatter(), we need to provide X, Y, and Z coordinates when we are creating a 3D scatter plot. Yet the parameters that are supported in 2D scatter() can be applied to 3D scatter() as well: fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Change the size, shape and color of markers ax.scatter(cat_df.x, cat_df.y, cat_df.z, s=4, c="g", marker="o") plt.show() To change the viewing angle and elevation of the 3D plot, we can make use of view_init(). The azim parameter specifies the azimuth angle in the X-Y plane, while elev specifies the elevation angle. When the azimuth angle is 0, the X-Y plane would appear to the north from you. Meanwhile, an azimuth angle of 180 would show you the south side of the X-Y plane: fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(cat_df.x, cat_df.y, cat_df.z,s=4, c="g", marker="o") # elev stores the elevation angle in the z plane azim stores the # azimuth angle in the x,y plane ax.view_init(azim=180, elev=10) plt.show() 3D bar chart We introduced candlestick plots for showing Open-High-Low-Close (OHLC) financial data. In addition, a 3D bar chart can be employed to show OHLC across time. The next figure shows a typical example of plotting a 5-day OHLC bar chart: import matplotlib.pyplot as plt import numpy as np from mpl_toolkits.mplot3d import Axes3D # Get 1 and every fifth row for the 5-day AAPL OHLC data ohlc_5d = stock_df[stock_df["Company"]=="AAPL"].iloc[1::5, :] fig = plt.figure() ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) ax.bar(xs, ys, zs=z, zdir='y', color=colors, alpha=0.8, width=5) plt.show() The method for setting ticks and labels is similar to other Matplotlib plotting functions: fig = plt.figure(figsize=(9,7)) ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) ax.bar(xs, ys, zs=z, zdir='y', color=colors, alpha=0.8) # Manually assign the ticks and tick labels ax.set_xticks(np.arange(ohlc_5d.shape[0])) ax.set_xticklabels(ohlc_5d["Date"], rotation=20, verticalalignment='baseline', horizontalalignment='right', fontsize='8') ax.set_yticks([30, 20, 10, 0]) ax.set_yticklabels(["Open", "High", "Low", "Close"]) # Set the z-axis label ax.set_zlabel('Price (US $)') # Rotate the viewport ax.view_init(azim=-42, elev=31) plt.tight_layout() plt.show() Caveats to consider while visualizing 3D plots in Matplotlib Due to the lack of a true 3D graphical rendering backend (such as OpenGL) and proper algorithm for detecting 3D objects' intersections, the 3D plotting capabilities of Matplotlib are not great but just adequate for typical applications. In the official Matplotlib FAQ (https://matplotlib.org/mpl_toolkits/mplot3d/faq.html), the author noted that 3D plots may not look right at certain angles. Besides, we also reported that mplot3d would failed to clip bar charts if zlim is set (https://github.com/matplotlib/matplotlib/ issues/8902; see also https://github.com/matplotlib/matplotlib/issues/209). Without improvements in the 3D rendering backend, these issues are hard to fix. To better illustrate the latter issue, let's try to add ax.set_zlim3d(bottom=110, top=150) right above plt.tight_layout() in the previous 3D bar chart: Clearly, something is going wrong, as the bars overshoot the lower boundary of the axes. We will try to address the latter issue through the following workaround: # FuncFormatter to add 110 to the tick labels def major_formatter(x, pos): return "{}".format(x+110) fig = plt.figure(figsize=(9,7)) ax = fig.add_subplot(111, projection='3d') # Create one color-coded bar chart for Open, High, Low and Close prices. for color, col, z in zip(['r', 'g', 'b', 'y'], ["Open", "High", "Low", "Close"], [30, 20, 10, 0]): xs = np.arange(ohlc_5d.shape[0]) ys = ohlc_5d[col] # Assign color to the bars colors = [color] * len(xs) # Truncate the y-values by 110 ax.bar(xs, ys-110, zs=z, zdir='y', color=colors, alpha=0.8) # Manually assign the ticks and tick labels ax.set_xticks(np.arange(ohlc_5d.shape[0])) ax.set_xticklabels(ohlc_5d["Date"], rotation=20, verticalalignment='baseline', horizontalalignment='right', fontsize='8') # Set the z-axis label ax.set_yticks([30, 20, 10, 0]) ax.set_yticklabels(["Open", "High", "Low", "Close"]) ax.zaxis.set_major_formatter(FuncFormatter(major_formatter)) ax.set_zlabel('Price (US $)') # Rotate the viewport ax.view_init(azim=-42, elev=31) plt.tight_layout() plt.show() Basically, we truncated the y values by 110, and then we used a tick formatter (major_formatter) to shift the tick value back to the original. For 3D scatter plots, we can simply remove the data points that exceed the boundary of set_zlim3d() in order to generate a proper figure. However, these workarounds may not work for every 3D plot type. Conclusion We didn't go into too much detail of the 3D plotting capability of Matplotlib, as it is yet to be polished. For simple 3D plots, Matplotlib already suffices. The learning curve can be reduced if we use the same package for both 2D and 3D plots. You are advised to take a look at MayaVi2, Plotly, and VisPy if you require more powerful 3D plotting functions. If you enjoyed this excerpt, be sure to check out the book it is from.
Read more
  • 0
  • 0
  • 59052

article-image-visualizing-univariate-distribution-seaborn
Sugandha Lahoti
16 Nov 2017
7 min read
Save for later

Visualizing univariate distribution in Seaborn

Sugandha Lahoti
16 Nov 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim titled Matplotlib 2.x By Example. [/box] Seaborn by Michael Waskom is a statistical visualization library that is built on top of Matplotlib. It comes with handy functions for visualizing categorical variables, univariate distributions, and bivariate distributions. In this article, we will visualize univariate distribution in Seaborn. Visualizing univariate distribution Seaborn makes the task of visualizing the distribution of a dataset much easier. In this example, we are going to use the annual population summary published by the Department of Economic and Social Affairs, United Nations, in 2015. Projected population figures towards 2100 were also included in the dataset. Let's see how it distributes among different countries in 2017 by plotting a bar plot: import seaborn as sns import matplotlib.pyplot as plt # Extract USA population data in 2017 current_population = population_df[(population_df.Location == 'United States of America') & (population_df.Time == 2017) & (population_df.Sex != 'Both')] # Population Bar chart sns.barplot(x="AgeGrp",y="Value", hue="Sex", data = current_population) # Use Matplotlib functions to label axes rotate tick labels ax = plt.gca() ax.set(xlabel="Age Group", ylabel="Population (thousands)") ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45) plt.title("Population Barchart (USA)") # Show the figure plt.show() Bar chart in Seaborn The seaborn.barplot() function shows a series of data points as rectangular bars. If multiple points per group are available, confidence intervals will be shown on top of the bars to indicate the uncertainty of the point estimates. Like most other Seaborn functions, various input data formats are supported, such as Python lists, Numpy arrays, pandas Series, and pandas DataFrame. A more traditional way to show the population structure is through the use of a population pyramid. So what is a population pyramid? As its name suggests, it is a pyramid-shaped plot that shows the age distribution of a population. It can be roughly classified into three classes, namely constrictive, stationary, and expansive for populations that are undergoing negative, stable, and rapid growth respectively. For instance, constrictive populations have a lower proportion of young people, so the pyramid base appears to be constricted. Stable populations have a more or less similar number of young and middle-aged groups. Expansive populations, on the other hand, have a large proportion of youngsters, thus resulting in pyramids with enlarged bases. We can build a population pyramid by plotting two bar charts on two subplots with a shared y axis: import seaborn as sns import matplotlib.pyplot as plt # Extract USA population data in 2017 current_population = population_df[(population_df.Location == 'United States of America') & (population_df.Time == 2017) & (population_df.Sex != 'Both')] # Change the age group to descending order current_population = current_population.iloc[::-1] # Create two subplots with shared y-axis fig, axes = plt.subplots(ncols=2, sharey=True) # Bar chart for male sns.barplot(x="Value",y="AgeGrp", color="darkblue", ax=axes[0], data = current_population[(current_population.Sex == 'Male')]) # Bar chart for female sns.barplot(x="Value",y="AgeGrp", color="darkred", ax=axes[1], data = current_population[(current_population.Sex == 'Female')]) # Use Matplotlib function to invert the first chart axes[0].invert_xaxis() # Use Matplotlib function to show tick labels in the middle axes[0].yaxis.tick_right() # Use Matplotlib functions to label the axes and titles axes[0].set_title("Male") axes[1].set_title("Female") axes[0].set(xlabel="Population (thousands)", ylabel="Age Group") axes[1].set(xlabel="Population (thousands)", ylabel="") fig.suptitle("Population Pyramid (USA)") # Show the figure plt.show() Since Seaborn is built on top of the solid foundations of Matplotlib, we can customize the plot easily using built-in functions of Matplotlib. In the preceding example, we used matplotlib.axes.Axes.invert_xaxis() to flip the male population plot horizontally, followed by changing the location of the tick labels to the right-hand side using matplotlib.axis.YAxis.tick_right(). We further customized the titles and axis labels for the plot using a combination of matplotlib.axes.Axes.set_title(), matplotlib.axes.Axes.set(), and matplotlib.figure.Figure.suptitle(). Let's try to plot the population pyramids for Cambodia and Japan as well by changing the line population_df.Location == 'United States of America' to population_df.Location == 'Cambodia' or  population_df.Location == 'Japan'. Can you classify the pyramids into one of the three population pyramid classes? To see how Seaborn simplifies the code for relatively complex plots, let's see how a similar plot can be achieved using vanilla Matplotlib. First, like the previous Seaborn-based example, we create two subplots with shared y axis: fig, axes = plt.subplots(ncols=2, sharey=True) Next, we plot horizontal bar charts using matplotlib.pyplot.barh() and set the location and labels of ticks, followed by adjusting the subplot spacing: # Get a list of tick positions according to the data bins y_pos = range(len(current_population.AgeGrp.unique())) # Horizontal barchart for male axes[0].barh(y_pos, current_population[(current_population.Sex == 'Male')].Value, color="darkblue") # Horizontal barchart for female axes[1].barh(y_pos, current_population[(current_population.Sex == 'Female')].Value, color="darkred") # Show tick for each data point, and label with the age group axes[0].set_yticks(y_pos) axes[0].set_yticklabels(current_population.AgeGrp.unique()) # Increase spacing between subplots to avoid clipping of ytick labels plt.subplots_adjust(wspace=0.3) Finally, we use the same code to further customize the look and feel of the figure: # Invert the first chart axes[0].invert_xaxis() # Show tick labels in the middle axes[0].yaxis.tick_right() # Label the axes and titles axes[0].set_title("Male") axes[1].set_title("Female") axes[0].set(xlabel="Population (thousands)", ylabel="Age Group") axes[1].set(xlabel="Population (thousands)", ylabel="") fig.suptitle("Population Pyramid (USA)") # Show the figure plt.show() When compared to the Seaborn-based code, the pure Matplotlib implementation requires extra lines to define the tick positions, tick labels, and subplot spacing. For some other Seaborn plot types that include extra statistical calculations such as linear regression, and Pearson correlation, the code reduction is even more dramatic. Therefore, Seaborn is a "batteries-included" statistical visualization package that allows users to write less verbose code. Histogram and distribution fitting in Seaborn In the population example, the raw data was already binned into different age groups. What if the data is not binned (for example, the BigMac Index data)? Turns out, seaborn.distplot can help us to process the data into bins and show us a histogram as a result. Let's look at this example: import seaborn as sns import matplotlib.pyplot as plt # Get the BigMac index in 2017 current_bigmac = bigmac_df[(bigmac_df.Date == "2017-01-31")] # Plot the histogram ax = sns.distplot(current_bigmac.dollar_price) plt.show() The seaborn.distplot function expects either pandas Series, single-dimensional numpy.array, or a Python list as input. Then, it determines the size of the bins according to the Freedman-Diaconis rule, and finally it fits a kernel density estimate (KDE) over the histogram. KDE is a non-parametric method used to estimate the distribution of a variable. We can also supply a parametric distribution, such as beta, gamma, or normal distribution, to the fit argument. In this example, we are going to fit the normal distribution from the scipy.stats package over the Big Mac Index dataset: from scipy import stats ax = sns.distplot(current_bigmac.dollar_price, kde=False, fit=stats.norm) plt.show() [INSERT IMAGE] You have now equipped yourself with the knowledge to visualize univariate data in Seaborn as Bar Charts, Histogram, and distribution fitting. To have more fun visualizing data with Seaborn and Matplotlib, check out the book,  this snippet appears from.
Read more
  • 0
  • 0
  • 24734
article-image-kriging-interpolation-geostatistics
Guest Contributor
15 Nov 2017
7 min read
Save for later

Using R to implement Kriging - A Spatial Interpolation technique for Geostatistics data

Guest Contributor
15 Nov 2017
7 min read
The Kriging interpolation technique is being increasingly used in geostatistics these days. But how does Kriging work to create a prediction, after all? To start with, Kriging is a method where the distance and direction between the sample data points indicate a spatial correlation. This correlation is then used to explain the different variations in the surface. In cases where the distance and direction give appropriate spatial correlation, Kriging will be able to predict surface variations in the most effective way. As such, we often see Kriging being used in Geology and Soil Sciences. Kriging generates an optimal output surface for prediction which it estimates based on a scattered set with z-values. The procedure involves investigating the z-values’ spatial behavior in an ‘interactive’ manner where advanced statistical relationships are measured (autocorrelation). Mathematically speaking, Kriging is somewhat similar to regression analysis and its whole idea is to predict the unknown value of a function at a given point by calculating the weighted average of all known functional values in the neighborhood. To get the output value for a location, we take the weighted sum of already measured values in the surrounding (all the points that we intend to consider around a specific radius), using a  formula such as the following: In a regression equation, λi would represent the weights of how far the points are from the prediction location. However, in Kriging, λi represent not just the weights of how far the measured points are from prediction location, but also how the measured points are arranged spatially around the prediction location. First, the variograms and covariance functions are generated to create the spatial autocorrelation of data. Then, that data is used to make predictions. Thus, unlike the deterministic interpolation techniques like Inverse Distance Weighted (IDW) and Spline interpolation tools, Kriging goes beyond just estimating a prediction surface. Here, it brings an element of certainty in that prediction surface. That is why experts rate kriging so highly for a strong prediction. Instead of a weather report forecasting a 2 mm rain on a certain Saturday, Kriging also tells you what is the "probability" of a 2 mm rain on that Saturday. We hope you enjoy this simple R tutorial on Kriging by Berry Boessenkool. Geostatistics: Kriging - spatial interpolation between points, using semivariance We will be covering following sections in our tutorial with supported illustrations: Packages read shapefile Variogram Kriging Plotting Kriging: packages install.packages("rgeos") install.packages("sf") install.packages("geoR") library(sf) # for st_read (read shapefiles), # st_centroid, st_area, st_union library(geoR) # as.geodata, variog, variofit, # krige.control, krige.conv, legend.krige ## Warning: package ’sf’ was built under R version 3.4.1 Kriging: read shapefile / few points for demonstration x <- c(1,1,2,2,3,3,3,4,4,5,6,6,6) y <- c(4,7,3,6,2,4,6,2,6,5,1,5,7) z <- c(5,9,2,6,3,5,9,4,8,8,3,6,7) plot(x,y, pch="+", cex=z/4) Kriging: read shapefile II GEODATA <- as.geodata(cbind(x,y,z)) plot(GEODATA) Kriging: Variogram I EMP_VARIOGRAM <- variog(GEODATA) ## variog: computing omnidirectional variogram FIT_VARIOGRAM <- variofit(EMP_VARIOGRAM) ## variofit: covariance model used is matern ## variofit: weights used: npairs ## variofit: minimisation function used: optim ## Warning in variofit(EMP_VARIOGRAM): initial values not provided - running the default search ## variofit: searching for best initial value ... selected values: ## sigmasq phi tausq kappa ## initial.value "9.19" "3.65" "0" "0.5" ## status "est" "est" "est" "fix" ## loss value: 401.578968904954 Kriging: Variogram II plot(EMP_VARIOGRAM) lines(FIT_VARIOGRAM) Kriging: Kriging res <- 0.1 grid <- expand.grid(seq(min(x),max(x),res), seq(min(y),max(y),res)) krico <- krige.control(type.krige="OK", obj.model=FIT_VARIOGRAM) krobj <- krige.conv(GEODATA, locations=grid, krige=krico) ## krige.conv: model with constant mean ## krige.conv: Kriging performed using global neighbourhood # KRigingObjekt Kriging: Plotting I image(krobj, col=rainbow2(100)) legend.krige(col=rainbow2(100), x.leg=c(6.2,6.7), y.leg=c(2,6), vert=T, off=-0.5, values=krobj$predict) contour(krobj, add=T) colPoints(x,y,z, col=rainbow2(100), legend=F) points(x,y) Kriging: Plotting II library("berryFunctions") # scatterpoints by color colPoints(x,y,z, add=F, cex=2, legargs=list(y1=0.8,y2=1)) Kriging: Plotting III colPoints(grid[ ,1], grid[ ,2], krobj$predict, add=F, cex=2, col2=NA, legargs=list(y1=0.8,y2=1)) Time for a real dataset Precipitation from ca 250 gauges in Brandenburg  as Thiessen Polygons with steep gradients at edges: Exercise 41: Kriging Load and plot the shapefile in PrecBrandenburg.zip with sf::st_read. With colPoints in the package berryFunctions, add the precipitation  values at the centroids of the polygons. Calculate the variogram and fit a semivariance curve. Perform kriging on a grid with a useful resolution (keep in mind that computing time rises exponentially  with grid size). Plot the interpolated  values with image or an equivalent (Rclick 4.15) and add contour lines. What went wrong? (if you used the defaults, the result will be dissatisfying.) How can you fix it? Solution for exercise 41.1-2: Kriging Data # Shapefile: p <- sf::st_read("data/PrecBrandenburg/niederschlag.shp", quiet=TRUE) # Plot prep pcol <- colorRampPalette(c("red","yellow","blue"))(50) clss <- berryFunctions::classify(p$P1, breaks=50)$index # Plot par(mar = c(0,0,1.2,0)) plot(p, col=pcol[clss], max.plot=1) # P1: Precipitation # kriging coordinates cent <- sf::st_centroid(p) berryFunctions::colPoints(cent$x, cent$y, p$P1, add=T, cex=0.7, legargs=list(y1=0.8,y2=1), col=pcol) points(cent$x, cent$y, cex=0.7) Solution for exercise 41.3: Variogram library(geoR) # Semivariance: geoprec <- as.geodata(cbind(cent$x,cent$y,p$P1)) vario <- variog(geoprec, max.dist=130000) ## variog: computing omnidirectional variogram fit <-variofit(vario) ## Warning in variofit(vario): initial values not provided - running the default search ## variofit: searching for best initial value ... selected values: ## sigmasq phi tausq kappa ## initial.value "1326.72" "19999.93" "0" "0.5" ## status "est" "est" "est" "fix" ## loss value: 107266266.76371 plot(vario) ; lines(fit) # distance to closest other point: d <- sapply(1:nrow(cent), function(i) min(berryFunctions::distance( cent$x[i], cent$y[i], cent$x[-i], cent$y[-i]))) hist(d/1000, breaks=20, main="distance to closest gauge [km]") mean(d) # 8 km ## [1] 8165.633 Solution for exercise 41.4-5: Kriging # Kriging: res <- 1000 # 1 km, since stations are 8 km apart on average grid <- expand.grid(seq(min(cent$x),max(cent$x),res), seq(min(cent$y),max(cent$y),res)) krico <- krige.control(type.krige="OK", obj.model=fit) krobj <- krige.conv(geoprec, locations=grid, krige=krico) ## krige.conv: model with constant mean ## krige.conv: Kriging performed using global neighbourhood # Set values outside of Brandenburg to NA: grid_sf <- sf::st_as_sf(grid, coords=1:2, crs=sf::st_crs(p)) isinp <- sapply(sf::st_within(grid_sf, p), length) > 0 krobj2 <- krobj krobj2$predict[!isinp] <- NA Solution for exercise 41.5: Kriging Visualization geoR:::image.kriging(krobj2, col=pcol) colPoints(cent$x, cent$y, p$P1, col=pcol, zlab="Prec", cex=0.7, legargs=list(y1=0.1,y2=0.8, x1=0.78, x2=0.87, horiz=F)) plot(p, add=T, col=NA, border=8)#; points(cent$x,cent$y, cex=0.7) [author title="About the author"]Berry started working with R in 2010 during his studies of Geoecology at Potsdam University, Germany. He has since then given a number of R programming workshops and tutorials, including full-week workshops in Kyrgyzstan and Kazachstan. He has left the department for environmental science in summer 2017 to focus more on software development and teaching in the data science industry. Please follow the Github link for detailed explanations on Berry’s R courses. [/author]
Read more
  • 0
  • 0
  • 33391

article-image-k-means-clustering-python
Aaron Lazar
09 Nov 2017
9 min read
Save for later

Implementing K-Means Clustering in Python

Aaron Lazar
09 Nov 2017
9 min read
This article is an adaptation of content from the book Data Science Algorithms in a Week, by David Natingga. I’ve modified it a bit and made turned it into a sequence from a thriller, starring Agents Hobbs and O’Connor, from the FBI. The idea is to practically show you how to implement a k-means cluster in your friendly neighborhood language, Python. Agent Hobbs: Agent… Agent O’Connor… O’Connor! Agent O’Connor: Blimey! Uh.. Ohh.. Sorry, sir! Hobbs: ‘Tat’s abou’ the fifth time oive’ caught you sleeping on duty, young man! O’Connor: Umm. Apologies, sir. I just arrived here, and didn’t have much to… Hobbs: Cut the bull, agent! There’s an important case oime workin’ on and oi’ need information on this righ’ awai’! Here’s the list of missing persons kidnapped so far by the suspects. The suspects now taunt us with open clues abou’ their next target! Based on the information, we’ve narrowed their target list down to Miss. Gibbons and Mr. Hudson. Hobbs throws a file across O’Connor’s desk. Hobbs says as he storms out the door: You ‘ave an hour to find out who needs the special security, so better get working. O’Connor: Yes, sir! Bloody hell, that was close! Here’s the information O’Connor has: He needs to find what the probability is, that the 11th person with a height of 172cm, weight of 60kg, and with long hair is a man. O’Connor gets to work. To simplify matters, he removes the column Hair length as well as the column Gender, since he would like to cluster the people in the table based on their height and weight. To find out whether the 11th person in the table is more likely to be a man or a woman, he uses Clustering: Analysis O’Connor may apply scaling to the initial data, but to simplify the matters, he uses the unscaled data in the algorithm. He clusters the data into the two clusters since there are two possibilities for genders – a male or a female. Then he aims to classify a person with the height 172cm and weight 60kg to be more likely a man if and only if there are more men in that cluster. The clustering algorithm is a very efficient technique. Thus classifying this way is very fast, especially if there are a large number of the features to classify. So he goes on to apply the k-means clustering algorithm to the data he has. First, he picks up the initial centroids. He assumes the first centroid be, for example, a person with the height 180cm and the weight 75kg denoted in a vector as (180,75). Then the point that is furthest away from (180,75) is (155,46). So that will be the second centroid. The points that are closer to the first centroid (180,75) by taking Euclidean distance are (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). So these points will be in the first cluster. The points that are closer to the second centroid (155,46) are (155,46), (164,53), (162,52), (166,55). So these points will be in the second cluster. He displays the current situation of these two clusters in the image as below. Clustering of people by their height and weight He then recomputes the centroids of the clusters. The blue cluster with the features (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60) will have the centroid ((180+174+184+168+178+170+172)/7 (75+71+83+63+70+59+60)/7)~(175.14,68.71). The red cluster with the features (155,46), (164,53), (162,52), (166,55) will have the centroid ((155+164+162+166)/4, (46+53+52+55)/4) = (161.75, 51.5). Reclassifying the points using the new centroid, the classes of the points do not change. The blue cluster will have the points (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). The red cluster will have the points (155,46), (164,53), (162,52), (166,55). Therefore the clustering algorithm terminates with clusters as displayed in the following image: Clustering of people by their height and weight Now he classifies the instance (172,60) as to whether it is a male or a female. The instance (172,60) is in the blue cluster. So it is similar to the features in the blue cluster. Are the remaining features in the blue cluster more likely males or females? 5 out of 6 features are males, only 1 is a female. Since the majority of the features are males in the blue cluster and the person (172,60) is in the blue cluster as well, he classifies the person with the height 172cm and the weight 60kg as a male. Implementing K-Means clustering in Python O’Connor implements the k-means clustering algorithm in Python. It takes as an input a CSV file with one data item per line. A data item is converted to a point. The algorithm classifies these points into the specified number of clusters. In the end, the clusters are visualized on the graph using the matplotlib library: # source_code/5/k-means_clustering.py import math import imp import sys import matplotlib.pyplot as plt import matplotlib import sys sys.path.append('../common') import common # noqa matplotlib.style.use('ggplot') # Returns k initial centroids for the given points. def choose_init_centroids(points, k): centroids = [] centroids.append(points[0]) while len(centroids) < k: # Find the centroid that with the greatest possible distance # to the closest already chosen centroid. candidate = points[0] candidate_dist = min_dist(points[0], centroids) for point in points: dist = min_dist(point, centroids) if dist > candidate_dist: candidate = point candidate_dist = dist centroids.append(candidate) return centroids # Returns the distance of a point from the closest point in points. def min_dist(point, points): min_dist = euclidean_dist(point, points[0]) for point2 in points: dist = euclidean_dist(point, point2) if dist < min_dist: min_dist = dist return min_dist # Returns an Euclidean distance of two 2-dimensional points. def euclidean_dist((x1, y1), (x2, y2)): return math.sqrt((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2)) # PointGroup is a tuple that contains in the first coordinate a 2d point # and in the second coordinate a group which a point is classified to. def choose_centroids(point_groups, k): centroid_xs = [0] * k centroid_ys = [0] * k group_counts = [0] * k for ((x, y), group) in point_groups: centroid_xs[group] += x centroid_ys[group] += y group_counts[group] += 1 centroids = [] for group in range(0, k): centroids.append(( float(centroid_xs[group]) / group_counts[group], float(centroid_ys[group]) / group_counts[group])) return centroids # Returns the number of the centroid which is closest to the point. # This number of the centroid is the number of the group where # the point belongs to. def closest_group(point, centroids): selected_group = 0 selected_dist = euclidean_dist(point, centroids[0]) for i in range(1, len(centroids)): dist = euclidean_dist(point, centroids[i]) if dist < selected_dist: selected_group = i selected_dist = dist return selected_group # Reassigns the groups to the points according to which centroid # a point is closest to. def assign_groups(point_groups, centroids): new_point_groups = [] for (point, group) in point_groups: new_point_groups.append( (point, closest_group(point, centroids))) return new_point_groups # Returns a list of pointgroups given a list of points. def points_to_point_groups(points): point_groups = [] for point in points: point_groups.append((point, 0)) return point_groups # Clusters points into the k groups adding every stage # of the algorithm to the history which is returned. def cluster_with_history(points, k): history = [] centroids = choose_init_centroids(points, k) point_groups = points_to_point_groups(points) while True: point_groups = assign_groups(point_groups, centroids) history.append((point_groups, centroids)) new_centroids = choose_centroids(point_groups, k) done = True for i in range(0, len(centroids)): if centroids[i] != new_centroids[i]: done = False Break if done: return history centroids = new_centroids # Program start csv_file = sys.argv[1] k = int(sys.argv[2]) everything = False # The third argument sys.argv[3] represents the number of the step of the # algorithm starting from 0 to be shown or "last" for displaying the last # step and the number of the steps. if sys.argv[3] == "last": everything = True Else: step = int(sys.argv[3]) data = common.csv_file_to_list(csv_file) points = data_to_points(data) # Represent every data item by a point. history = cluster_with_history(points, k) if everything: print "The total number of steps:", len(history) print "The history of the algorithm:" (point_groups, centroids) = history[len(history) - 1] # Print all the history. print_cluster_history(history) # But display the situation graphically at the last step only. draw(point_groups, centroids) else: (point_groups, centroids) = history[step] print "Data for the step number", step, ":" print point_groups, centroids draw(point_groups, centroids) Input data from gender classification He saves data from the classification into the CSV file: # source_code/5/persons_by_height_and_weight.csv 180,75 174,71 184,83 168,63 178,70 170,59 164,53 155,46 162,52 166,55 172,60 Program output for the classification data O’Connor runs the program implementing k-means clustering algorithm on the data from the classification. The numerical argument 2 means that he would like to cluster the data into 2 clusters: $ python k-means_clustering.py persons_by_height_weight.csv 2 last The total number of steps: 2 The history of the algorithm: Step number 0: point_groups = [((180.0, 75.0), 0), ((174.0, 71.0), 0), ((184.0, 83.0), 0), ((168.0, 63.0), 0), ((178.0, 70.0), 0), ((170.0, 59.0), 0), ((164.0, 53.0), 1), ((155.0, 46.0), 1), ((162.0, 52.0), 1), ((166.0, 55.0), 1), ((172.0, 60.0), 0)] centroids = [(180.0, 75.0), (155.0, 46.0)] Step number 1: point_groups = [((180.0, 75.0), 0), ((174.0, 71.0), 0), ((184.0, 83.0), 0), ((168.0, 63.0), 0), ((178.0, 70.0), 0), ((170.0, 59.0), 0), ((164.0, 53.0), 1), ((155.0, 46.0), 1), ((162.0, 52.0), 1), ((166.0, 55.0), 1), ((172.0, 60.0), 0)] centroids = [(175.14285714285714, 68.71428571428571), (161.75, 51.5)] The program also outputs a graph visible in the 2nd image. The parameter last means that O’Connor would like the program to do the clustering until the last step. If he wants to display only the first step (step 0), he can change last to 0 to run: $ python k-means_clustering.py persons_by_height_weight.csv 2 0 Upon the execution of the program, O’Connor gets the graph of the clusters and their centroids at the initial step, as in image 1. He heaves a sigh of relief. Hobbs returns just then: Oye there O’Connor, not snoozing again now O’are ya? O’Connor: Not at all, sir. I think we need to provide Mr. Hudson with special protection because it looks like he’s the next target. Hobbs raises an eyebrow as he adjusts his gun in it’s holster: Emm, O’are ya sure, agent? O’Connor replies with a smile: 83.33% confident, sir! Hobbs: Wha’ are we waiting for then, eh? Let’s go get em! If you liked reading this mystery, go ahead and buy the book it was inspired by: Data Science Algorithms in a Week, by David Natingga.
Read more
  • 0
  • 1
  • 24066

article-image-integrating-keras-tensorflow-r
Amey Varangaonkar
08 Nov 2017
6 min read
Save for later

How to Integrate Keras and TensorFlow with R

Amey Varangaonkar
08 Nov 2017
6 min read
[box type="info" align="" class="" width=""]The following is an excerpt from the book Neural Networks with R, Chapter 7, Use Cases of Neural Networks - Advanced Topics, written by Giuseppe Ciaburro and Balaji Venkateswaran. In this post, we see how to integrate popular deep learning libraries and frameworks like TensorFlow with R for effective neural network modeling.[/box] TensorFlow is an open source numerical computing library provided by Google for machine intelligence. It hides all of the programming required to build deep learning models and gives the developers a black box interface to program. The Keras API for TensorFlow provides a high-level interface for neural networks. Python is the de facto programming language for deep learning, but R is catching up. Deep learning libraries are now available with R and a developer can easily download TensorFlow or Keras similar to other R libraries and use them. In TensorFlow, nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. TensorFlow was originally developed by the Google Brain Team within Google's machine intelligence research for machine learning and deep neural networks research, but it is now available in the public domain. TensorFlow exploits GPU processing when configured appropriately. The generic use cases for TensorFlow are as follows: Image recognition Computer vision Voice/sound recognition Time series analysis Language detection Language translation Text-based processing Handwriting Recognition (HWR) Many others Integrating Tensorflow with R In this section, we will see how we can bring TensorFlow libraries into R. This will open up a huge number of possibilities with deep learning using TensorFlow with R. In order to use TensorFlow, we must first install Python. If you don't have a Python installation on your machine, it's time to get it. Python is a dynamic Object-Oriented Programming (OOP) language that can be used for many types of software development. It offers strong support for integration with other languages and programs, is provided with a large standard library, and can be learned within a few days. Many Python programmers can confirm a substantial increase in productivity and feel that it encourages the development of higher quality code and maintainability. Python runs on Windows, Linux/Unix, macOS X, OS/2, Amiga, Palm Handhelds, and Nokia phones. It also works on Java and .NET virtual machines. Python is licensed under the OSI-approved open source license; its use is free, including for commercial products. [box type="shadow" align="" class="" width=""]If you do not know which version to use, there is a document that could help you choose. In principle, if you have to start from scratch, we recommend choosing Python 3, and if you need to use third-party software packages that may not be compatible with Python 3, we recommend using Python 2.7. All information about the available versions and how to install Python is given at https://www.python.org/[/box] After properly installing the Python version of our machine, we have to worry about installing TensorFlow. We can retrieve all library information and available versions of the operating system from the following link: https://www.tensorflow.org/ . Also, in the install section, we can find a series of guides that explain how to install a version of TensorFlow that allows us to write applications in Python. Guides are available for the following operating systems: Installing TensorFlow on Ubuntu Installing TensorFlow on macOS X Installing TensorFlow on Windows Installing TensorFlow from sources For example, to install Tensorflow on Windows, we must choose one of the following types: TensorFlow with CPU support only TensorFlow with GPU support To install TensorFlow, start a terminal with privileges as administrator. Then issue the appropriate pip3 install command in that terminal. To install the CPU-only version, enter the following command: C:> pip3 install --upgrade tensorflow A series of code lines will be displayed on the video to keep us informed of the execution of the installation procedure, as shown in the following figure: Note: The output screenshot has been cropped for clarity purposes At this point, we can return to our favorite environment; I am referring to the R development environment. We will need to install the interface to TensorFlow. The R interface to TensorFlow lets you work productively using the high-level Keras and Estimator APIs, and when you need more control, it provides full access to the core TensorFlow API. To install the R interface to TensorFlow, follow the steps below First, install the tensorflow R package from CRAN as follows: install.packages("tensorflow")  2. Then, use the install_tensorflow() function to install TensorFlow (for a proper installation procedure, you must have administrator privileges): library(tensorflow) install_tensorflow()  3. We can confirm that the installation succeeded: sess = tf$Session() hello <- tf$constant('Hello, TensorFlow!') sess$run(hello) This will provide you with a default installation of TensorFlow suitable for use with the tensorflow R package. Read on if you want to learn about additional installation options.  4. If you want to install a version of TensorFlow that takes advantage of NVIDIA GPUs, you need to have the correct CUDA libraries installed. In the following code, we can check the success of the installation: library(tensorflow) > sess = tf$Session() > hello <- tf$constant('Hello, TensorFlow!') > sess$run(hello) b'Hello, TensorFlow!' Integrating Keras with R Keras is a set of open source neural network libraries coded in Python. It is capable of running on top of MxNet, TensorFlow, or Theano. The steps to install Keras in RStudio are very simple. The following code snippet gives the steps for installation and we can check whether Keras is working by checking the load of the MNIST dataset. By default, RStudio loads the CPU version of TensorFlow. Once Keras is loaded, we have a powerful set of deep learning libraries that can be utilized by R programmers to execute neural networks and deep learning. To install Keras for R, follow these steps: Run the following code install.packages("devtools") devtools::install_github("rstudio/keras")  2. At this point, we load the keras library: library(keras)  3. Finally, we check whether keras is installed correctly by loading the MNIST dataset: > data=dataset_mnist()   If you found this excerpt useful, make sure you check out the book Neural Networks with R, containing an interesting coverage of many such useful and insightful topics.
Read more
  • 0
  • 0
  • 5400
article-image-object-detection-go-tensorflow
Kunal Parikh
07 Nov 2017
10 min read
Save for later

Implementing Object detection with Go using TensorFlow

Kunal Parikh
07 Nov 2017
10 min read
[box type="note" align="" class="" width=""]The following is an excerpt from the book Machine Learning with Go, Chapter 8, Neural Networks and Deep Learning, written by Daniel Whitenack. The associated code bundle is available at the end of the article.[/box] Deep learning models are powerful! Especially for tasks like computer vision. However, you should also keep in mind that complicated combinations of these neural net components are also extremely hard to interpret. That is, determining why the model made a certain prediction can be near impossible. This can be a problem when you need to maintain compliance in certain industries and jurisdictions, and it also might inhibit debugging or maintenance of your applications. That being said, there are some major efforts to improve the interpretability of deep learning models. Notable among these efforts is the LIME project: Deep learning with Go There are a variety of options when you are looking to build or utilize deep learning models from Go. This, as with deep learning itself, is an ever-changing landscape. However, the options for building, training and utilizing deep learning models in Go are generally as follows: Use a Go package: There are Go packages that allow you to use Go as your main interface to build and train deep learning models. The most features and developed of these packages is Gorgonia. It treats Go as a first-class citizen and is written in Go, even if it does make significant usage of cgo to interface with numerical libraries. Use an API or Go client for a non-Go DL framework: You can interface with popular deep learning services and frameworks from Go including TensorFlow, MachineBox, H2O, and the various cloud providers or third-party API offerings (such as IBM Watson). TensorFlow and Machine Box actually have Go bindings or SDKs, which are continually improving. For the other services, you may need to interact via REST or even call binaries using exec. Use cgo: Of course, Go can talk to and integrate with C/C++ libraries for deep learning, including the TensorFlow libraries and various libraries from Intel. However, this is a difficult road, and it is only recommended when absolutely necessary. As TensorFlow is by far the most popular framework for deep learning (at the moment), we will briefly explore the second category listed here. However, the Tensorflow Go bindings are under active development and some functionality is quite crude at the moment. The TensorFlow team recommends that if you are going to use a TensorFlow model in Go, you first train and export this model using Python. That pre-trained model can then be utilized from Go, as we will demonstrate in the next section. There are a number of members of the community working very hard to make Go more of a first-class citizen for TensorFlow. As such, it is likely that the rough edges of the TensorFlow bindings will be smoothed over the coming year. Setting up TensorFlow for use with Go The TensorFlow team has provided some good docs to install TensorFlow and get it ready for usage with Go. These docs can be found here. There are a couple of preliminary steps, but once you have the TensorFlow C libraries installed, you can get the following Go package: $ go get github.com/tensorflow/tensorflow/tensorflow/go Everything should be good to go if you were able to get github.com/tensorflow/tensorflow/tensorflow/go without error, but you can make sure that you are ready to use TensorFlow by executing the following tests: $ go test github.com/tensorflow/tensorflow/tensorflow/go ok github.com/tensorflow/tensorflow/tensorflow/go 0.045s Retrieving and calling a pretrained TensorFlow model The model that we are going to use is a Google model for object recognition in images called Inception. The model can be retrieved as follows: $ mkdir model $ cd model $ wget https://storage.googleapis.com/download.tensorflow.org/models/inception5h.z ip --2017-09-09 18:29:03-- https://storage.googleapis.com/download.tensorflow.org/models/inception5h.z ip Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.6.112, 2607:f8b0:4009:812::2010 Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.6.112|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 49937555 (48M) [application/zip] Saving to: ‘inception5h.zip’ inception5h.zip 100%[====================================================================== ===================================================>] 47.62M 19.0MB/s in 2.5s 2017-09-09 18:29:06 (19.0 MB/s) - ‘inception5h.zip’ saved [49937555/49937555] $ unzip inception5h.zip Archive: inception5h.zip inflating: imagenet_comp_graph_label_strings.txt inflating: tensorflow_inception_graph.pb inflating: LICENSE After unzipping the compressed model, you should see a *.pb file. This is a protobuf file that represents a frozen state of the model. Think back to our simple neural network. The network was fully defined by a series of weights and biases. Although more complicated, this model can be defined in a similar way and these definitions are stored in this protobuf file. To call this model, we will use some example code from the TensorFlow Go bindings docs--. This code loads the model and uses the model to detect and label the contents of a *.jpg image. As the code is included in the TensorFlow docs, I will spare the details and just highlight a couple of snippets. To load the model, we perform the following: // Load the serialized GraphDef from a file. modelfile, labelsfile, err := modelFiles(*modeldir) if err != nil { log.Fatal(err) } model, err := ioutil.ReadFile(modelfile) if err != nil { log.Fatal(err) } Then we load the graph definition of the deep learning model and create a new TensorFlow session with the graph, as shown in the following code: // Construct an in-memory graph from the serialized form. graph := tf.NewGraph() if err := graph.Import(model, ""); err != nil { log.Fatal(err) } // Create a session for inference over graph. session, err := tf.NewSession(graph, nil) if err != nil { log.Fatal(err) } defer session.Close() Finally, we can make an inference using the model as follows: // Run inference on *imageFile. // For multiple images, session.Run() can be called in a loop (and concurrently). Alternatively, images can be batched since the model // accepts batches of image data as input. tensor, err := makeTensorFromImage(*imagefile) if err != nil { log.Fatal(err) } output, err := session.Run( map[tf.Output]*tf.Tensor{ graph.Operation("input").Output(0): tensor, }, []tf.Output{ graph.Operation("output").Output(0), }, nil) if err != nil { log.Fatal(err) } // output[0].Value() is a vector containing probabilities of // labels for each image in the "batch". The batch size was 1. // Find the most probable label index. probabilities := output[0].Value().([][]float32)[0] printBestLabel(probabilities, labelsfile) Object detection with Go using TensorFlow The Go program for object detection, as specified in the TensorFlow GoDocs, can be called as follows: $ ./myprogram -dir=<path/to/the/model/dir> -image=<path/to/a/jpg/image> When the program is called, it will utilize the pretrained and loaded model to infer the contents of the specified image. It will then output the most likely contents of that image along with its calculated probability. To illustrate this, let's try performing the object detection on the following image of an airplane, saved as airplane.jpg: Running the TensorFlow model from Go gives the following results: $ go build $ ./myprogram -dir=model -image=airplane.jpg 2017-09-09 20:17:30.655757: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655807: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655814: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655818: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:17:30.655822: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. BEST MATCH: (86% likely) airliner After some suggestions about speeding up CPU computations, we get a result: airliner. Wow! That's pretty cool. We just performed object recognition with TensorFlow right from our Go program! Let try another one. This time, we will use pug.jpg, which looks like the following: Running our program again with this image gives the following: $ ./myprogram -dir=model -image=pug.jpg 2017-09-09 20:20:32.323855: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323896: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323902: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323906: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:20:32.323911: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. BEST MATCH: (84% likely) pug Success! Not only did the model detect that there was a dog in the picture, it correctly identified that there was a pug dog in the picture. Let try just one more. As this is a Go article, we cannot resist trying gopher.jpg, which looks like the following (huge thanks to Renee French, the artist behind the Go gopher): Running the model gives the following result: $ ./myprogram -dir=model -image=gopher.jpg 2017-09-09 20:25:57.967753: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967801: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967808: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967812: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-09-09 20:25:57.967817: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. BEST MATCH: (12% likely) safety pin Well, I guess we can't win them all. Looks like we need to refactor our model to be able to recognize Go gophers. More specifically, we should probably add a bunch of Go gophers to our training dataset, because a Go gopher is definitely not a safety pin! [box type="download" align="" class="" width=""]The code for this exercise is available here.[/box] Summary Congratulations! We have gone from parsing data with Go to calling deep learning models from Go. You now know the basics of neural networks and can implement them and utilize them in your Go programs. In the next chapter, we will discuss how to get these models and applications off of your laptops and run them at production scale in data pipelines. If you enjoyed the above excerpt from the book Machine Learning with Go, check out the book to learn how to build machine learning apps with Go.
Read more
  • 0
  • 0
  • 19384

article-image-ensemble-methods-optimize-machine-learning-models
Guest Contributor
07 Nov 2017
8 min read
Save for later

Ensemble Methods to Optimize Machine Learning Models

Guest Contributor
07 Nov 2017
8 min read
[box type="info" align="" class="" width=""]We are happy to bring you an elegant guest post on ensemble methods by Benjamin Rogojan, popularly known as The Seattle Data Guy.[/box] How do data scientists improve their algorithm’s accuracy or improve the robustness of a model? A method that is tried and tested is ensemble learning. It is a must know topic if you claim to be a data scientist and/or a machine learning engineer. Especially, if you are planning to go in for a data science/machine learning interview. Essentially, ensemble learning stays true to the meaning of the word ‘ensemble’. Rather than having several people who are singing at different octaves to create one beautiful harmony (each voice filling in the void of the other), ensemble learning uses hundreds to thousands of models of the same algorithm that work together to find the correct classification. Another way to think about ensemble learning is the fable of the blind men and the elephant. Each blind man in the story seeks to identify the elephant in front of them. However, they work separately and come up with their own conclusions about the animal. Had they worked in unison, they might have been able to eventually figure out what they were looking at. Similarly, ensemble learning utilizes the workings of different algorithms and combines them for a successful and optimal classification. Ensemble methods such as Boosting and Bagging have led to an increased robustness of statistical models with decreased variance. Before we begin with explaining the various ensemble methods, let us have a glance at the common bond between them, Bootstrapping. Bootstrap: The common glue Explaining Bootstrapping can occasionally be missed by many data scientists. However, an understanding of bootstrapping is essential as both the ensemble methods, Boosting and Bagging, are based on the concept of bootstrapping. Figure 1: Bootstrapping In machine learning terms, bootstrap method refers to random sampling with replacement. This sample, after replacement, is referred as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances, and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics which the sample might have contained. This would, in turn, affect the overall mean, standard deviation, and other descriptive metrics of a data set. Ultimately, leading to the development of more robust models. The above diagram depicts each sample population having different and non-identical pieces. Bootstrapping is also great for small size data sets that may have a tendency to overfit. In fact, we recommended this to one company who was concerned because their data sets were far from “Big Data”. Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping are more robust and can handle new datasets depending on the methodology chosen (boosting or bagging). The bootstrap method can also test the stability of a solution. By using multiple sample data sets and then testing multiple models, it can increase robustness.  In certain cases, one sample data set may have a larger mean than another or a different standard deviation. This might break a model that was overfitted and not tested using data sets with different variations. One of the many reasons bootstrapping has become so common is because of the increase in computing power. This allows multiple permutations to be done with different resamples. Let us now move on to the most prominent ensemble methods: Bagging and Boosting. Ensemble Method 1: Bagging Bagging actually refers to Bootstrap Aggregators. Most papers or posts that explain bagging algorithms are bound to refer to Leo Breiman’s work, a paper published in 1996 called  “Bagging Predictors”. In the paper, Leo describes bagging as: “Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.” Bagging helps reduce variance from models that are accurate only on the data they were trained on. This problem is also known as overfitting. Overfitting happens when a function fits the data too well. Typically this is because the actual equation is highly complicated to take into account each data point and the outlier. Figure 2: Overfitting Another example of an algorithm that can overfit easily is a decision tree. The models that are developed using decision trees require very simple heuristics. Decision trees are composed of a set of if-else statements done in a specific order. Thus, if the data set is changed to a new data set that might have some bias or difference in the spread of underlying features compared to the previous set, the model will fail to be as accurate as before. This is because the data will not fit the model well. Bagging gets around the overfitting problem by creating its own variance amongst the data. This is done by sampling and replacing data while it tests multiple hypotheses (models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes (median, average, etc). Once each model has developed a hypothesis, the models use voting for classification or averaging for regression. This is where the “Aggregating” of the “Bootstrap Aggregating” comes into play. As in the figure shown below, each hypothesis has the same weight as all the others. (When we later discuss boosting, this is one of the places the two methodologies differ.) Figure 3: Bagging Essentially, all these models run at the same time and vote on the hypothesis which is the most accurate. This helps to decrease variance i.e. reduce the overfit. Ensemble Method 2: Boosting Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Unlike bagging (that has each model run independently and then aggregate the outputs at the end without preference to any model), boosting is all about “teamwork”. Each model that runs dictates what features the next model will focus on. Boosting also requires bootstrapping. However, there is another difference here. Unlike bagging, boosting weights each sample of data. This means some samples will be run more often than others. Why put weights on the samples of data? Figure 4: Boosting When boosting runs each model, it tracks which data samples are the most successful and which are not. The data sets with the most misclassified outputs are given heavier weights. This is because such data sets are considered to have more complexity. Thus, more iterations would be required to properly train the model. During the actual classification stage, boosting tracks the model's error rates to ensure that better models are given better weights. That way, when the “voting” occurs, like in bagging, the models with better outcomes have a stronger pull on the final output. Which of these ensemble methods is right for me? Ensemble methods generally out-perform a single model. This is why many Kaggle winners have utilized ensemble methodologies. Another important ensemble methodology, not discussed here, is stacking. Boosting and bagging are both great techniques to decrease variance. However, they won’t fix every problem, and they themselves have their own issues. There are different reasons why you would use one over the other. Bagging is great for decreasing variance when a model is overfitted. However, boosting is likely to be a better pick of the two methods. This is because it is also great for decreasing bias in an underfit model. On the other hand, boosting is likely to suffer performance issues. This is where experience and subject matter expertise comes in! It may seem easy to jump on the first model that works. However, it is important to analyze the algorithm and all the features it selects. For instance, a decision tree that sets specific leafs shouldn’t be implemented if it can’t be supported with other data points and visuals. It is not just about trying AdaBoost, or Random forests on various datasets. The final algorithm is driven depending on the results an algorithm is getting and the support provided. [author title="About the Author"] Benjamin Rogojan Ben has spent his career focused on healthcare data. He has focused on developing algorithms to detect fraud, reduce patient readmission and redesign insurance provider policy to help reduce the overall cost of healthcare. He has also helped develop analytics for marketing and IT operations in order to optimize limited resources such as employees and budget. Ben privately consults on data science and engineering problems both solo as well as with a company called Acheron Analytics. He has experience both working hands-on with technical problems as well as helping leadership teams develop strategies to maximize their data.[/author]
Read more
  • 0
  • 0
  • 33954
Modal Close icon
Modal Close icon