Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-understanding-hadoop-backup-and-recovery-needs

10 Aug 2015

25 min read

Understanding Hadoop Backup and Recovery Needs

10 Aug 2015

0
0
7059

article-image-bayesian-network-fundamentals

Packt

10 Aug 2015

25 min read

Bayesian Network Fundamentals

Packt

10 Aug 2015

25 min read

In this article by Ankur Ankan and Abinash Panda, the authors of Mastering Probabilistic Graphical Models Using Python, we'll cover the basics of random variables, probability theory, and graph theory. We'll also see the Bayesian models and the independencies in Bayesian models. A graphical model is essentially a way of representing joint probability distribution over a set of random variables in a compact and intuitive form. There are two main types of graphical models, namely directed and undirected. We generally use a directed model, also known as a Bayesian network, when we mostly have a causal relationship between the random variables. Graphical models also give us tools to operate on these models to find conditional and marginal probabilities of variables, while keeping the computational complexity under control. (For more resources related to this topic, see here.) Probability theory To understand the concepts of probability theory, let's start with a real-life situation. Let's assume we want to go for an outing on a weekend. There are a lot of things to consider before going: the weather conditions, the traffic, and many other factors. If the weather is windy or cloudy, then it is probably not a good idea to go out. However, even if we have information about the weather, we cannot be completely sure whether to go or not; hence we have used the words probably or maybe. Similarly, if it is windy in the morning (or at the time we took our observations), we cannot be completely certain that it will be windy throughout the day. The same holds for cloudy weather; it might turn out to be a very pleasant day. Further, we are not completely certain of our observations. There are always some limitations in our ability to observe; sometimes, these observations could even be noisy. In short, uncertainty or randomness is the innate nature of the world. The probability theory provides us the necessary tools to study this uncertainty. It helps us look into options that are unlikely yet probable. Random variable Probability deals with the study of events. From our intuition, we can say that some events are more likely than others, but to quantify the likeliness of a particular event, we require the probability theory. It helps us predict the future by assessing how likely the outcomes are. Before going deeper into the probability theory, let's first get acquainted with the basic terminologies and definitions of the probability theory. A random variable is a way of representing an attribute of the outcome. Formally, a random variable X is a function that maps a possible set of outcomes ? to some set E, which is represented as follows: X : ? ? E As an example, let us consider the outing example again. To decide whether to go or not, we may consider the skycover (to check whether it is cloudy or not). Skycover is an attribute of the day. Mathematically, the random variable skycover (X) is interpreted as a function, which maps the day (?) to its skycover values (E). So when we say the event X = 40.1, it represents the set of all the days {?} such that , where is the mapping function. Formally speaking, . Random variables can either be discrete or continuous. A discrete random variable can only take a finite number of values. For example, the random variable representing the outcome of a coin toss can take only two values, heads or tails; and hence, it is discrete. Whereas, a continuous random variable can take infinite number of values. For example, a variable representing the speed of a car can take any number values. For any event whose outcome is represented by some random variable (X), we can assign some value to each of the possible outcomes of X, which represents how probable it is. This is known as the probability distribution of the random variable and is denoted by P(X). For example, consider a set of restaurants. Let X be a random variable representing the quality of food in a restaurant. It can take up a set of values, such as {good, bad, average}. P(X), represents the probability distribution of X, that is, if P(X = good) = 0.3, P(X = average) = 0.5, and P(X = bad) = 0.2. This means there is 30 percent chance of a restaurant serving good food, 50 percent chance of it serving average food, and 20 percent chance of it serving bad food. Independence and conditional independence In most of the situations, we are rather more interested in looking at multiple attributes at the same time. For example, to choose a restaurant, we won't only be looking just at the quality of food; we might also want to look at other attributes, such as the cost, location, size, and so on. We can have a probability distribution over a combination of these attributes as well. This type of distribution is known as joint probability distribution. Going back to our restaurant example, let the random variable for the quality of food be represented by Q, and the cost of food be represented by C. Q can have three categorical values, namely {good, average, bad}, and C can have the values {high, low}. So, the joint distribution for P(Q, C) would have probability values for all the combinations of states of Q and C. P(Q = good, C = high) will represent the probability of a pricey restaurant with good quality food, while P(Q = bad, C = low) will represent the probability of a restaurant that is less expensive with bad quality food. Let us consider another random variable representing an attribute of a restaurant, its location L. The cost of food in a restaurant is not only affected by the quality of food but also the location (generally, a restaurant located in a very good location would be more costly as compared to a restaurant present in a not-very-good location). From our intuition, we can say that the probability of a costly restaurant located at a very good location in a city would be different (generally, more) than simply the probability of a costly restaurant, or the probability of a cheap restaurant located at a prime location of city is different (generally less) than simply probability of a cheap restaurant. Formally speaking, P(C = high | L = good) will be different from P(C = high) and P(C = low | L = good) will be different from P(C = low). This indicates that the random variables C and L are not independent of each other. These attributes or random variables need not always be dependent on each other. For example, the quality of food doesn't depend upon the location of restaurant. So, P(Q = good | L = good) or P(Q = good | L = bad)would be the same as P(Q = good), that is, our estimate of the quality of food of the restaurant will not change even if we have knowledge of its location. Hence, these random variables are independent of each other. In general, random variables can be considered as independent of each other, if: They may also be considered independent if: We can easily derive this conclusion. We know the following from the chain rule of probability: P(X, Y) = P(X) P(Y | X) If Y is independent of X, that is, if X | Y, then P(Y | X) = P(Y). Then: P(X, Y) = P(X) P(Y) Extending this result on multiple variables, we can easily get to the conclusion that a set of random variables are independent of each other, if their joint probability distribution is equal to the product of probabilities of each individual random variable. Sometimes, the variables might not be independent of each other. To make this clearer, let's add another random variable, that is, the number of people visiting the restaurant N. Let's assume that, from our experience we know the number of people visiting only depends on the cost of food at the restaurant and its location (generally, lesser number of people visit costly restaurants). Does the quality of food Q affect the number of people visiting the restaurant? To answer this question, let's look into the random variable affecting N, cost C, and location L. As C is directly affected by Q, we can conclude that Q affects N. However, let's consider a situation when we know that the restaurant is costly, that is, C = high and let's ask the same question, "does the quality of food affect the number of people coming to the restaurant?". The answer is no. The number of people coming only depends on the price and location, so if we know that the cost is high, then we can easily conclude that fewer people will visit, irrespective of the quality of food. Hence, . This type of independence is called conditional independence. Installing tools Let's now see some coding examples using pgmpy, to represent joint distributions and independencies. Here, we will mostly work with IPython and pgmpy (and a few other libraries) for coding examples. So, before moving ahead, let's get a basic introduction to these. IPython IPython is a command shell for interactive computing in multiple programming languages, originally developed for the Python programming language, which offers enhanced introspection, rich media, additional shell syntax, tab completion, and a rich history. IPython provides the following features: Powerful interactive shells (terminal and Qt-based) A browser-based notebook with support for code, text, mathematical expressions, inline plots, and other rich media Support for interactive data visualization and use of GUI toolkits Flexible and embeddable interpreters to load into one's own projects Easy-to-use and high performance tools for parallel computing You can install IPython using the following command: >>> pip3 install ipython To start the IPython command shell, you can simply type ipython3 in the terminal. For more installation instructions, you can visit http://ipython.org/install.html. pgmpy pgmpy is a Python library to work with Probabilistic Graphical models. As it's currently not on PyPi, we will need to build it manually. You can get the source code from the Git repository using the following command: >>> git clone https://github.com/pgmpy/pgmpy Now cd into the cloned directory switch branch for version used and build it with the following code: >>> cd pgmpy >>> git checkout book/v0.1 >>> sudo python3 setup.py install For more installation instructions, you can visit http://pgmpy.org/install.html. With both IPython and pgmpy installed, you should now be able to run the examples. Representing independencies using pgmpy To represent independencies, pgmpy has two classes, namely IndependenceAssertion and Independencies. The IndependenceAssertion class is used to represent individual assertions of the form of or . Let's see some code to represent assertions: # Firstly we need to import IndependenceAssertion In [1]: from pgmpy.independencies import IndependenceAssertion # Each assertion is in the form of [X, Y, Z] meaning X is # independent of Y given Z. In [2]: assertion1 = IndependenceAssertion('X', 'Y') In [3]: assertion1 Out[3]: (X _|_ Y) Here, assertion1 represents that the variable X is independent of the variable Y. To represent conditional assertions, we just need to add a third argument to IndependenceAssertion: In [4]: assertion2 = IndependenceAssertion('X', 'Y', 'Z') In [5]: assertion2 Out [5]: (X _|_ Y | Z) In the preceding example, assertion2 represents . IndependenceAssertion also allows us to represent assertions in the form of . To do this, we just need to pass a list of random variables as arguments: In [4]: assertion2 = IndependenceAssertion('X', 'Y', 'Z') In [5]: assertion2 Out[5]: (X _|_ Y | Z) Moving on to the Independencies class, an Independencies object is used to represent a set of assertions. Often, in the case of Bayesian or Markov networks, we have more than one assertion corresponding to a given model, and to represent these independence assertions for the models, we generally use the Independencies object. Let's take a few examples: In [8]: from pgmpy.independencies import Independencies # There are multiple ways to create an Independencies object, we # could either initialize an empty object or initialize with some # assertions. In [9]: independencies = Independencies() # Empty object In [10]: independencies.get_assertions() Out[10]: [] In [11]: independencies.add_assertions(assertion1, assertion2) In [12]: independencies.get_assertions() Out[12]: [(X _|_ Y), (X _|_ Y | Z)] We can also directly initialize Independencies in these two ways: In [13]: independencies = Independencies(assertion1, assertion2) In [14]: independencies = Independencies(['X', 'Y'], ['A', 'B', 'C']) In [15]: independencies.get_assertions() Out[15]: [(X _|_ Y), (A _|_ B | C)] Representing joint probability distributions using pgmpy We can also represent joint probability distributions using pgmpy's JointProbabilityDistribution class. Let's say we want to represent the joint distribution over the outcomes of tossing two fair coins. So, in this case, the probability of all the possible outcomes would be 0.25, which is shown as follows: In [16]: from pgmpy.factors import JointProbabilityDistribution as Joint In [17]: distribution = Joint(['coin1', 'coin2'], [2, 2], [0.25, 0.25, 0.25, 0.25]) Here, the first argument includes names of random variable. The second argument is a list of the number of states of each random variable. The third argument is a list of probability values, assuming that the first variable changes its states the slowest. So, the preceding distribution represents the following: In [18]: print(distribution) +--------------------------------------+ ¦ coin1 ¦ coin2 ¦ P(coin1,coin2) ¦ ¦---------+---------+------------------¦ ¦ coin1_0 ¦ coin2_0 ¦ 0.2500 ¦ +---------+---------+------------------¦ ¦ coin1_0 ¦ coin2_1 ¦ 0.2500 ¦ +---------+---------+------------------¦ ¦ coin1_1 ¦ coin2_0 ¦ 0.2500 ¦ +---------+---------+------------------¦ ¦ coin1_1 ¦ coin2_1 ¦ 0.2500 ¦ +--------------------------------------+ We can also conduct independence queries over these distributions in pgmpy: In [19]: distribution.check_independence('coin1', 'coin2') Out[20]: True Conditional probability distribution Let's take an example to understand conditional probability better. Let's say we have a bag containing three apples and five oranges, and we want to randomly take out fruits from the bag one at a time without replacing them. Also, the random variables and represent the outcomes in the first try and second try respectively. So, as there are three apples and five oranges in the bag initially, and . Now, let's say that in our first attempt we got an orange. Now, we cannot simply represent the probability of getting an apple or orange in our second attempt. The probabilities in the second attempt will depend on the outcome of our first attempt and therefore, we use conditional probability to represent such cases. Now, in the second attempt, we will have the following probabilities that depend on the outcome of our first try: , , , and . The Conditional Probability Distribution (CPD) of two variables and can be represented as , representing the probability of given that is the probability of after the event has occurred and we know it's outcome. Similarly, we can have representing the probability of after having an observation for . The simplest representation of CPD is tabular CPD. In a tabular CPD, we construct a table containing all the possible combinations of different states of the random variables and the probabilities corresponding to these states. Let's consider the earlier restaurant example. Let's begin by representing the marginal distribution of the quality of food with Q. As we mentioned earlier, it can be categorized into three values {good, bad, average}. For example, P(Q) can be represented in the tabular form as follows: Quality P(Q) Good 0.3 Normal 0.5 Bad 0.2 Similarly, let's say P(L) is the probability distribution of the location of the restaurant. Its CPD can be represented as follows: Location P(L) Good 0.6 Bad 0.4 As the cost of restaurant C depends on both the quality of food Q and its location L, we will be considering P(C | Q, L), which is the conditional distribution of C, given Q and L: Location Good Bad Quality Good Normal Bad Good Normal Bad Cost High 0.8 0.6 0.1 0.6 0.6 0.05 Low 0.2 0.4 0.9 0.4 0.4 0.95 Representing CPDs using pgmpy Let's first see how to represent the tabular CPD using pgmpy for variables that have no conditional variables: In [1]: from pgmpy.factors import TabularCPD # For creating a TabularCPD object we need to pass three # arguments: the variable name, its cardinality that is the number # of states of the random variable and the probability value # corresponding each state. In [2]: quality = TabularCPD(variable='Quality', variable_card=3, values=[[0.3], [0.5], [0.2]]) In [3]: print(quality) +----------------------+ ¦ ['Quality', 0] ¦ 0.3 ¦ +----------------+-----¦ ¦ ['Quality', 1] ¦ 0.5 ¦ +----------------+-----¦ ¦ ['Quality', 2] ¦ 0.2 ¦ +----------------------+ In [4]: quality.variables Out[4]: OrderedDict([('Quality', [State(var='Quality', state=0), State(var='Quality', state=1), State(var='Quality', state=2)])]) In [5]: quality.cardinality Out[5]: array([3]) In [6]: quality.values Out[6]: array([0.3, 0.5, 0.2]) You can see here that the values of the CPD are a 1D array instead of a 2D array, which you passed as an argument. Actually, pgmpy internally stores the values of the TabularCPD as a flattened numpy array. In [7]: location = TabularCPD(variable='Location', variable_card=2, values=[[0.6], [0.4]]) In [8]: print(location) +-----------------------+ ¦ ['Location', 0] ¦ 0.6 ¦ +-----------------+-----¦ ¦ ['Location', 1] ¦ 0.4 ¦ +-----------------------+ However, when we have conditional variables, we also need to specify them and the cardinality of those variables. Let's define the TabularCPD for the cost variable: In [9]: cost = TabularCPD( variable='Cost', variable_card=2, values=[[0.8, 0.6, 0.1, 0.6, 0.6, 0.05], [0.2, 0.4, 0.9, 0.4, 0.4, 0.95]], evidence=['Q', 'L'], evidence_card=[3, 2]) Graph theory The second major framework for the study of probabilistic graphical models is graph theory. Graphs are the skeleton of PGMs, and are used to compactly encode the independence conditions of a probability distribution. Nodes and edges The foundation of graph theory was laid by Leonhard Euler when he solved the famous Seven Bridges of Konigsberg problem. The city of Konigsberg was set on both sides by the Pregel river and included two islands that were connected and maintained by seven bridges. The problem was to find a walk to exactly cross all the bridges once in a single walk. To visualize the problem, let's think of the graph in Fig 1.1: Fig 1.1: The Seven Bridges of Konigsberg graph Here, the nodes a, b, c, and d represent the land, and are known as vertices of the graph. The line segments ab, bc, cd, da, ab, and bc connecting the land parts are the bridges and are known as the edges of the graph. So, we can think of the problem of crossing all the bridges once in a single walk as tracing along all the edges of the graph without lifting our pencils. Formally, a graph G = (V, E) is an ordered pair of finite sets. The elements of the set V are known as the nodes or the vertices of the graph, and the elements of are the edges or the arcs of the graph. The number of nodes or cardinality of G, denoted by |V|, are known as the order of the graph. Similarly, the number of edges denoted by |E| are known as the size of the graph. Here, we can see that the Konigsberg city graph shown in Fig 1.1 is of order 4 and size 7. In a graph, we say that two vertices, u, v ? V are adjacent if u, v ? E. In the City graph, all the four vertices are adjacent to each other because there is an edge for every possible combination of two vertices in the graph. Also, for a vertex v ? V, we define the neighbors set of v as . In the City graph, we can see that b and d are neighbors of c. Similarly, a, b, and c are neighbors of d. We define an edge to be a self loop if the start vertex and the end vertex of the edge are the same. We can put it more formally as, any edge of the form (u, u), where u ? V is a self loop. Until now, we have been talking only about graphs whose edges don't have a direction associated with them, which means that the edge (u, v) is same as the edge (v, u). These types of graphs are known as undirected graphs. Similarly, we can think of a graph whose edges have a sense of direction associated with it. For these graphs, the edge set E would be a set of ordered pair of vertices. These types of graphs are known as directed graphs. In the case of a directed graph, we also define the indegree and outdegree for a vertex. For a vertex v ? V, we define its outdegree as the number of edges originating from the vertex v, that is, . Similarly, the indegree is defined as the number of edges that end at the vertex v, that is, . Walk, paths, and trails For a graph G = (V, E) and u,v ? V, we define a u - v walk as an alternating sequence of vertices and edges, starting with u and ending with v. In the City graph of Fig 1.1, we can have an example of a - d walk as . If there aren't multiple edges between the same vertices, then we simply represent a walk by a sequence of vertices. As in the case of the Butterfly graph shown in Fig 1.2, we can have a walk W : a, c, d, c, e: Fig 1.2: Butterfly graph—a undirected graph A walk with no repeated edges is known as a trail. For example, the walk in the City graph is a trail. Also, a walk with no repeated vertices, except possibly the first and the last, is known as a path. For example, the walk in the City graph is a path. Also, a graph is known as cyclic if there are one or more paths that start and end at the same node. Such paths are known as cycles. Similarly, if there are no cycles in a graph, it is known as an acyclic graph. Bayesian models In most of the real-life cases when we would be representing or modeling some event, we would be dealing with a lot of random variables. Even if we would consider all the random variables to be discrete, there would still be exponentially large number of values in the joint probability distribution. Dealing with such huge amount of data would be computationally expensive (and in some cases, even intractable), and would also require huge amount of memory to store the probability of each combination of states of these random variables. However, in most of the cases, many of these variables are marginally or conditionally independent of each other. By exploiting these independencies, we can reduce the number of values we need to store to represent the joint probability distribution. For instance, in the previous restaurant example, the joint probability distribution across the four random variables that we discussed (that is, quality of food Q, location of restaurant L, cost of food C, and the number of people visiting N) would require us to store 23 independent values. By the chain rule of probability, we know the following: P(Q, L, C, N) = P(Q) P(L|Q) P(C|L, Q) P(N|C, Q, L) Now, let us try to exploit the marginal and conditional independence between the variables, to make the representation more compact. Let's start by considering the independency between the location of the restaurant and quality of food over there. As both of these attributes are independent of each other, P(L|Q) would be the same as P(L). Therefore, we need to store only one parameter to represent it. From the conditional independence that we have seen earlier, we know that . Thus, P(N|C, Q, L) would be the same as P(N|C, L); thus needing only four parameters. Therefore, we now need only (2 + 1 + 6 + 4 = 13) parameters to represent the whole distribution. We can conclude that exploiting independencies helps in the compact representation of joint probability distribution. This forms the basis for the Bayesian network. Representation A Bayesian network is represented by a Directed Acyclic Graph (DAG) and a set of Conditional Probability Distributions (CPD) in which: The nodes represent random variables The edges represent dependencies For each of the nodes, we have a CPD In our previous restaurant example, the nodes would be as follows: Quality of food (Q) Location (L) Cost of food (C) Number of people (N) As the cost of food was dependent on the quality of food (Q) and the location of the restaurant (L), there will be an edge each from Q ? C and L ? C. Similarly, as the number of people visiting the restaurant depends on the price of food and its location, there would be an edge each from L ? N and C ? N. The resulting structure of our Bayesian network is shown in Fig 1.3: Fig 1.3: Bayesian network for the restaurant example Factorization of a distribution over a network Each node in our Bayesian network for restaurants has a CPD associated to it. For example, the CPD for the cost of food in the restaurant is P(C|Q, L), as it only depends on the quality of food and location. For the number of people, it would be P(N|C, L) . So, we can generalize that the CPD associated with each node would be P(node|Par(node)) where Par(node) denotes the parents of the node in the graph. Assuming some probability values, we will finally get a network as shown in Fig 1.4: Fig 1.4: Bayesian network of restaurant along with CPDs Let us go back to the joint probability distribution of all these attributes of the restaurant again. Considering the independencies among variables, we concluded as follows: P(Q,C,L,N) = P(Q)P(L)P(C|Q, L)P(N|C, L) So now, looking into the Bayesian network (BN) for the restaurant, we can say that for any Bayesian network, the joint probability distribution over all its random variables {X1,X2,...,Xn} can be represented as follows: This is known as the chain rule for Bayesian networks. Also, we say that a distribution P factorizes over a graph G, if P can be encoded as follows: Here, ParG(X) is the parent of X in the graph G. Summary In this article, we saw how we can represent a complex joint probability distribution using a directed graph and a conditional probability distribution associated with each node, which is collectively known as a Bayesian network. Resources for Article: Further resources on this subject: Web Scraping with Python [article] Exact Inference Using Graphical Models [article] wxPython: Design Approaches and Techniques [article]

0
0
47571

Packt

07 Aug 2015

9 min read

NLTK for hackers

Packt

07 Aug 2015

9 min read

In this article written by Nitin Hardeniya, author of the book NLTK Essentials, we will learn that "Life is short, we need Python" that's the mantra I follow and truly believe in. As fresh graduates, we learned and worked mostly with C/C++/JAVA. While these languages have amazing features, Python has a charm of its own. The day I started using Python I loved it. I really did. The big coincidence here is that I finally ended up working with Python during my initial projects on the job. I started to love the kind of datastructures, Libraries, and echo system Python has for beginners as well as for an expert programmer. (For more resources related to this topic, see here.) Python as a language has advanced very fast and spatially. If you are a Machine learning/ Natural language Processing enthusiast, then Python is 'the' go-to language these days. Python has some amazing ways of dealing with strings. It has a very easy and elegant coding style, and most importantly a long list of open libraries. I can go on and on about Python and my love for it. But here I want to talk about very specifically about NLTK (Natural Language Toolkit), one of the most popular Python libraries for Natural language processing. NLTK is simply awesome, and in my opinion,it's the best way to learn and implement some of the most complex NLP concepts. NLTK has variety of generic text preprocessing tool, such as Tokenization, Stop word removal, Stemming, and at the same time,has some very NLP-specific tools,such as Part of speech tagging, Chunking, Named Entity recognition, and dependency parsing. NLTK provides some of the easiest solutions to all the above stages of NLP and that's why it is the most preferred library for any text processing/ text mining application. NLTK not only provides some pretrained models that can be applied directly to your dataset, it also provides ways to customize and build your own taggers, tokenizers, and so on. NLTK is a big library that has many tools available for an NLP developer. I have provided a cheat-sheet of some of the most common steps and their solutions using NLTK. In our book, NLTK Essentials, I have tried to give you enough information to deal with all these processing steps using NLTK. To show you the power of NLTK, let's try to develop a very easy application of finding topics in the unstructured text in a word cloud. Word CloudNLTK Instead of going further into the theoretical aspects of natural language processing, let's start with a quick dive into NLTK. I am going to start with some basic example use cases of NLTK. There is a good chance that you have already done something similar. First, I will give a typical Python programmer approach and then move on to NLTK for a much more efficient, robust, and clean solution. We will start analyzing with some example text content: >>>import urllib2>>># urllib2 is use to download the html content of the web link>>>response = urllib2.urlopen('http://python.org/')>>># You can read the entire content of a file using read() method>>>html = response.read()>>>print len(html)47020 For the current example, I have taken the content from Python's home page: https://www.python.org/. We don't have any clue about the kind of topics that are discussed in this URL, so let's say that we want to start an exploratory data analysis (EDA). Typically in a text domain, EDA can have many meanings, but will go with a simple case of what kinds of terms dominate the documents. What are the topics? How frequent are they? The process will involve some level of preprocessing we will try to do this in a pure Python wayand then we will do it using NLTK. Let's start with cleaning the html tags. One way to do this is to select just tokens, including numbers and character. Anybody who has worked with regular expression should be able to convert html string into a list of tokens: >>># regular expression based split the string>>>tokens = [tok for tok in html.split()]>>>print "Total no of tokens :"+ str(len(tokens))>>># first 100 tokens>>>print tokens[0:100]Total no of tokens :2860['<!doctype', 'html>', '', '', ''type="text/css"', 'media="not', 'print,', 'braille,' ...] As you can see, there is an excess of html tags and other unwanted characters when we use the preceding method. A cleaner version of the same task will look something like this: >>>import re>>># using the split function https://docs.python.org/2/library/re.html>>>tokens = re.split('W+',html)>>>print len(tokens)>>>print tokens[0:100]5787['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '8', 'msapplication', 'tooltip', 'content', 'The', 'official', 'home', 'of', 'the', 'Python', 'Programming', 'Language', 'meta', 'name', 'apple' ...] This looks much cleaner now. But still you can do more; I leave it to you to try to remove as much noise as you can. You can still look for word length as a criteria and remove words that have a length one—it will remove elements,such as 7, 8, and so on, which are just noise in this case. Now let's go to NLTK for the same task. There is a function called clean_html() that can do all the work we were looking for: >>>import nltk>>># http://www.nltk.org/api/nltk.html#nltk.util.clean_html>>>clean = nltk.clean_html(html)>>># clean will have entire string removing all the html noise>>>tokens = [tok for tok in clean.split()]>>>print tokens[:100]['Welcome', 'to', 'Python.org', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '▲', 'The', 'Python', 'Network', '&equiv;', 'Menu', 'Arts', 'Business' ...] Cool, right? This definitely is much cleaner and easier to do. No analysis in any EDA can start without distribution. Let's try to get the frequency distribution. First, let's do it the Python way, then I will tell you the NLTK recipe. >>>import operator>>>freq_dis={}>>>for tok in tokens:>>> if tok in freq_dis:>>> freq_dis[tok]+=1>>> else:>>> freq_dis[tok]=1>>># We want to sort this dictionary on values ( freq in this case )>>>sorted_freq_dist= sorted(freq_dis.items(), key=operator.itemgetter(1), reverse=True)>>> print sorted_freq_dist[:25][('Python', 55), ('>>>', 23), ('and', 21), ('to', 18), (',', 18), ('the', 14), ('of', 13), ('for', 12), ('a', 11), ('Events', 11), ('News', 11), ('is', 10), ('2014-', 10), ('More', 9), ('#', 9), ('3', 9), ('=', 8), ('in', 8), ('with', 8), ('Community', 7), ('The', 7), ('Docs', 6), ('Software', 6), (':', 6), ('3:', 5), ('that', 5), ('sum', 5)] Naturally, as this is Python's home page, Python and the >>> interpreters are the most common terms, also giving a sense about the website. A better and efficient approach is to use NLTK's FreqDist() function. For this, we will take a look at the same code we developed before: >>>import nltk>>>Freq_dist_nltk=nltk.FreqDist(tokens)>>>print Freq_dist_nltk>>>for k,v in Freq_dist_nltk.items():>>> print str(k)+':'+str(v)<FreqDist: 'Python': 55, '>>>': 23, 'and': 21, ',': 18, 'to': 18, 'the': 14, 'of': 13, 'for': 12, 'Events': 11, 'News': 11, ...>Python:55>>>:23and:21,:18to:18the:14of:13for:12Events:11News:11 Let's now do some more funky things. Let's plot this: >>>Freq_dist_nltk.plot(50, cumulative=False)>>># below is the plot for the frequency distributions We can see that the cumulative frequency is growing, and at words such as other and frequency 400, the curve is going into long tail. Still, there is some noise, and there are words such asthe, of, for, and =. These are useless words, and there is a terminology for these words. These words are stop words,such asthe, a, and an. Article pronouns are generally present in most of the documents; hence, they are not discriminative enough to be informative. In most of the NLP and information retrieval tasks, people generally remove stop words. Let's go back again to our running example: >>>stopwords=[word.strip().lower() for word in open("PATH/english.stop.txt")]>>>clean_tokens=[tok for tok in tokens if len(tok.lower())>1 and (tok.lower() not in stopwords)]>>>Freq_dist_nltk=nltk.FreqDist(clean_tokens)>>>Freq_dist_nltk.plot(50, cumulative=False) This looks much cleaner now! After finishing this much, you should be able to get something like this using word cloud: Please go to http://www.wordle.net/advanced for more word clouds. Summary To summarize, this article was intended to give you a brief introduction toNatural Language Processing. The book does assume some background in NLP andprogramming in Python, but we have tried to give a very quick head start to Pythonand NLP. Resources for Article: Further resources on this subject: Hadoop Monitoring and its aspects [Article] Big Data Analysis (R and Hadoop) [Article] SciPy for Signal Processing [Article]

0
0
2823

Packt

23 Jul 2015

18 min read

Elasticsearch – Spicing Up a Search Using Geo

Packt

23 Jul 2015

18 min read

A geo point refers to the latitude and longitude of a point on Earth. Each location on it has its own unique latitude and longitude. Elasticsearch is aware of geo-based points and allows you to perform various operations on top of it. In many contexts, it's also required to consider a geo location component to obtain various functionalities. For example, say you need to search for all the nearby restaurants that serve Chinese food or I need to find the nearest cab that is free. In some other situation, I need to find to which state a particular geo point location belongs to understand where I am currently standing. This article by Vineeth Mohan, author of the book Elasticsearch Blueprints, is modeled such that all the examples mentioned are related to real-life scenarios, of restaurant searching, for better understanding. Here, we take the example of sorting restaurants based on geographical preferences. A number of cases ranging from the simple, such as finding the nearest restaurant, to the more complex case, such as categorization of restaurants based on distance are covered in this article. What makes Elasticsearch unique and powerful is the fact that you can combine geo operation with any other normal search query to yield results clubbed with both the location data and the query data. (For more resources related to this topic, see here.) Restaurant search Let's consider creating a search portal for restaurants. The following are its requirements: To find the nearest restaurant with Chinese cuisine, which has the word ChingYang in its name. To decrease the importance of all restaurants outside city limits. To find the distance between the restaurant and current point for each of the preceding restaurant matches. To find whether the person is in a particular city's limit or not. To aggregate all restaurants within a distance of 10 km. That is, for a radius of the first 10 km, we have to compute the number of restaurants. For the next 10 km, we need to compute the number of restaurants and so on. Data modeling for restaurants Firstly, we need to see the aspects of data and model it around a JSON document for Elasticsearch to make sense of the data. A restaurant has a name, its location information, and rating. To store the location information, Elasticsearch has a provision to understand the latitude and longitude information and has features to conduct searches based on it. Hence, it would be best to use this feature. Let's see how we can do this. First, let's see what our document should look like: { "name" : "Tamarind restaurant", "location" : { "lat" : 1.10, "lon" : 1.54 } } Now, let's define the schema for the same: curl -X PUT "http://$hostname:9200/restaurants" -d '{ "index": { "number_of_shards": 1, "number_of_replicas": 1 }, "analysis":{ "analyzer":{ "flat" : { "type" : "custom", "tokenizer" : "keyword", "filter" : "lowercase" } } } }' echo curl -X PUT "http://$hostname:9200/restaurants /restaurant/_mapping" -d '{ "restaurant" : { "properties" : { "name" : { "type" : "string" }, "location" : { "type" : "geo_point", "accuracy" : "1km" } }} }' Let's now index some documents in the index. An example of this would be the Tamarind restaurant data shown in the previous section. We can index the data as follows: curl -XPOST 'http://localhost:9200/restaurants/restaurant' -d '{ "name": "Tamarind restaurant", "location": { "lat": 1.1, "lon": 1.54 } }' Likewise, we can index any number of documents. For the sake of convenience, we have indexed only a total of five restaurants for this article. The latitude and longitude should be of this format. Elasticsearch also accepts two other formats (geohash and lat_lon), but let's stick to this one. As we have mapped the field location to the type geo_point, Elasticsearch is aware of what this information means and how to act upon it. The nearest hotel problem Let's assume that we are at a particular point where the latitude is 1.234 and the longitude is 2.132. We need to find the nearest restaurants to this location. For this purpose, the function_score query is the best option. We can use the decay (Gauss) functionality of the function score query to achieve this: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "function_score": { "functions": [ { "gauss": { "location": { "scale": "1km", "origin": [ 1.231, 1.012 ] } } } ] } } }' Here, we tell Elasticsearch to give a higher score to the restaurants that are nearby the referral point we gave it. The closer it is, the higher is the importance. Maximum distance covered Now, let's move on to another example of finding restaurants that are within 10 kms from my current position. Those that are beyond 10 kms are of no interest to me. So, it almost makes up to a circle with a radius of 10 km from my current position, as shown in the following map: Our best bet here is using a geo distance filter. It can be used as follows: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "filtered": { "filter": { "geo_distance": { "distance": "100km", "location": { "lat": 1.232, "lon": 1.112 } } } } } }' Inside city limits Next, I need to consider only those restaurants that are inside a particular city limit; the rest are of no interest to me. As the city shown in the following map is rectangle in nature, this makes my job easier: Now, to see whether a geo point is inside a rectangle, we can use the bounding box filter. A rectangle is marked when you feed the top-left point and bottom-right point. Let's assume that the city is within the following rectangle with the top-left point as X and Y and the bottom-right point as A and B: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "filtered": { "query": { "match_all": {} }, "filter": { "geo_bounding_box": { "location": { "top_left": { "lat": 2, "lon": 0 }, "bottom_right": { "lat": 0, "lon": 2 } } } } } } }' Distance values between the current point and each restaurant Now, consider the scenario where you need to find the distance between the user location and each restaurant. How can we achieve this requirement? We can use scripts; the current geo coordinates are passed to the script and then the query to find the distance between each restaurant is run, as in the following code. Here, the current location is given as (1, 2): curl -XPOST 'http://localhost:9200/restaurants/_search?pretty' -d '{ "script_fields": { "distance": { "script": "doc['"'"'location'"'"'].arcDistanceInKm(1, 2)" } }, "fields": [ "name" ], "query": { "match": { "name": "chinese" } } }' We have used the function called arcDistanceInKm in the preceding query, which accepts the geo coordinates and then returns the distance between that point and the locations satisfied by the query. Note that the unit of distance calculated is in kilometers (km). You might have noticed a long list of quotes and double quotes before and after location in the script mentioned previously. This is the standard format and if we don't use this, it would result in returning the format error while processing. The distances are calculated from the current point to the filtered hotels and are returned in the distance field of response, as shown in the following code: { "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.7554128, "hits" : [ { "_index" : "restaurants", "_type" : "restaurant", "_id" : "AU08uZX6QQuJvMORdWRK", "_score" : 0.7554128, "fields" : { "distance" : [ 112.92927483176413 ], "name" : [ "Great chinese restaurant" ] } }, { "_index" : "restaurants", "_type" : "restaurant", "_id" : "AU08uZaZQQuJvMORdWRM", "_score" : 0.7554128, "fields" : { "distance" : [ 137.61635969665923 ], "name" : [ "Great chinese restaurant" ] } } ] } } Note that the distances measured from the current point to the hotels are direct distances and not road distances. Restaurant out of city limits One of my friends called me and asked me to join him on his journey to the next city. As we were leaving the city, he was particular that he wants to eat at some restaurant off the city limits, but outside the next city. For this, the requirement was translated to any restaurant that is minimum 15 kms and a maximum of 100 kms from the center of the city. Hence, we have something like a donut in which we have to conduct our search, as show in the following map: The area inside the donut is a match, but the area outside is not. For this donut area calculation, we have the geo_distance_range filter to our rescue. Here, we can apply the minimum distance and maximum distance in the fields from and to to populate the results, as shown in the following code: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "filtered": { "query": { "match_all": {} }, "filter": { "geo_distance_range": { "from": "15km", "to": "100km", "location": { "lat": 1.232, "lon": 1.112 } } } } } }' Restaurant categorization based on distance In an e-commerce solution, to search restaurants, it's required that you increase the searchable characteristics of the application. This means that if we are able to give a snapshot of results other than the top-10 results, it would add to the searchable characteristics of the search. For example, if we are able to show how many restaurants serve Indian, Thai, or other cuisines, it would actually help the user to get a better idea of the result set. In a similar manner, if we can tell them if the restaurant is near, at a medium distance, or far away, we can really pull a chord in the restaurant search user experience, as shown in the following map: Implementing this is not hard, as we have something called the distance range aggregation. In this aggregation type, we can handcraft the range of distance we are interested in and create a bucket for each of them. We can also define the key name we need, as shown in the following code: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "aggs": { "distanceRanges": { "geo_distance": { "field": "location", "origin": "1.231, 1.012", "unit": "meters", "ranges": [ { "key": "Near by Locations", "to": 200 }, { "key": "Medium distance Locations", "from": 200, "to": 2000 }, { "key": "Far Away Locations", "from": 2000 } ] } } } }' In the preceding code, we categorized the restaurants under three distance ranges, which are the nearby hotels (less than 200 meters), medium distant hotels (within 200 meters to 2,000 meters), and the far away ones (greater than 2,000 meters). This logic was translated to the Elasticsearch query using which, we received the results as follows: { "took": 44, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [ ] }, "aggregations": { "distanceRanges": { "buckets": [ { "key": "Near by Locations", "from": 0, "to": 200, "doc_count": 1 }, { "key": "Medium distance Locations", "from": 200, "to": 2000, "doc_count": 0 }, { "key": "Far Away Locations", "from": 2000, "doc_count": 4 } ] } } } In the results, we received how many restaurants are there in each distance range indicated by the doc_count field. Aggregating restaurants based on their nearness In the previous example, we saw the aggregation of restaurants based on their distance from the current point to three different categories. Now, we can consider another scenario in which we classify the restaurants on the basis of the geohash grids that they belong to. This kind of classification can be advantageous if the user would like to get a geographical picture of how the restaurants are distributed. Here is the code for a geohash-based aggregation of restaurants: curl -XPOST 'http://localhost:9200/restaurants/_search?pretty' -d '{ "size": 0, "aggs": { "DifferentGrids": { "geohash_grid": { "field": "location", "precision": 6 }, "aggs": { "restaurants": { "top_hits": {} } } } } }' You can see from the preceding code that we used the geohash aggregation, which is named as DifferentGrids and the precision here, is to be set as 6. The precision field value can be varied within the range of 1 to 12, with 1 being the lowest and 12 being the highest reference of precision. Also, we used another aggregation named restaurants inside the DifferentGrids aggregation. The restaurant aggregation uses the top_hits query to fetch the aggregated details from the DifferentGrids aggregation, which otherwise, would return only the key and doc_count values. So, running the preceding code gives us the following result: { "took":5, "timed_out":false, "_shards":{ "total":1, "successful":1, "failed":0 }, "hits":{ "total":5, "max_score":0, "hits":[ ] }, "aggregations":{ "DifferentGrids":{ "buckets":[ { "key":"s009", "doc_count":2, "restaurants":{... } }, { "key":"s01n", "doc_count":1, "restaurants":{... } }, { "key":"s00x", "doc_count":1, "restaurants":{... } }, { "key":"s00p", "doc_count":1, "restaurants":{... } } ] } } } As we can see from the response, there are four buckets with the key values, which are s009, s01n, s00x, and s00p. These key values represent the different geohash grids that the restaurants belong to. From the preceding result, we can evidently say that the s009 grid contains two restaurants inside it and all the other grids contain one each. A pictorial representation of the previous aggregation would be like the one shown on the following map: Summary We found that Elasticsearch can handle geo point and various geo-specific operations. A few geospecific and geopoint operations that we covered in this article were searching for nearby restaurants (restaurants inside a circle), searching for restaurants within a range (restaurants inside a concentric circle), searching for restaurants inside a city (restaurants inside a rectangle), searching for restaurants inside a polygon, and categorization of restaurants by the proximity. Apart from these, we can use Kibana, a flexible and powerful visualization tool provided by Elasticsearch for geo-based operations. Resources for Article: Further resources on this subject: Elasticsearch Administration [article] Extending ElasticSearch with Scripting [article] Indexing the Data [article]

0
0
3814

article-image-getting-started-apache-spark

Packt

17 Jul 2015

7 min read

Getting Started with Apache Spark

Packt

17 Jul 2015

7 min read

In this article by Rishi Yadav, the author of Spark Cookbook, we will cover the following recipes: Installing Spark from binaries Building the Spark source code with Maven (For more resources related to this topic, see here.) Introduction Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics. Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark's development and future releases. Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects. Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for different big data compute tasks, such as machine learning, SQL processing, graph processing, and real-time streaming. These libraries make development faster and can be combined in an arbitrary fashion. Though Spark is written in Scala, and this book only focuses on recipes in Scala, Spark also supports Java and Python. Spark is an open source community project, and everyone uses the pure open source Apache distributions for deployments, unlike Hadoop, which has multiple distributions available with vendor enhancements. The following figure shows the Spark ecosystem: The Spark runtime runs on top of a variety of cluster managers, including YARN (Hadoop's compute framework), Mesos, and Spark's own cluster manager called standalone mode. Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users. Mesos is a cluster manager, which is evolving into a data center operating system. YARN is Hadoop's compute framework that has a robust resource management feature that Spark can seamlessly use. Installing Spark from binaries Spark can be either built from the source code or precompiled binaries can be downloaded from http://spark.apache.org. For a standard use case, binaries are good enough, and this recipe will focus on installing Spark using binaries. Getting ready All the recipes in this book are developed using Ubuntu Linux but should work fine on any POSIX environment. Spark expects Java to be installed and the JAVA_HOME environment variable to be set. In Linux/Unix systems, there are certain standards for the location of files and directories, which we are going to follow in this book. The following is a quick cheat sheet: Directory Description /bin Essential command binaries /etc Host-specific system configuration /opt Add-on application software packages /var Variable data /tmp Temporary files /home User home directories How to do it... At the time of writing this, Spark's current version is 1.4. Please check the latest version from Spark's download page at http://spark.apache.org/downloads.html. Binaries are developed with a most recent and stable version of Hadoop. To use a specific version of Hadoop, the recommended approach is to build from sources, which will be covered in the next recipe. The following are the installation steps: Open the terminal and download binaries using the following command: $ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.4.tgz Unpack binaries: $ tar -zxf spark-1.4.0-bin-hadoop2.4.tgz Rename the folder containing binaries by stripping the version information: $ sudo mv spark-1.4.0-bin-hadoop2.4 spark Move the configuration folder to the /etc folder so that it can be made a symbolic link later: $ sudo mv spark/conf/* /etc/spark Create your company-specific installation directory under /opt. As the recipes in this book are tested on infoobjects sandbox, we are going to use infoobjects as directory name. Create the /opt/infoobjects directory: $ sudo mkdir -p /opt/infoobjects Move the spark directory to /opt/infoobjects as it's an add-on software package: $ sudo mv spark /opt/infoobjects/ Change the ownership of the spark home directory to root: $ sudo chown -R root:root /opt/infoobjects/spark Change permissions of the spark home directory, 0755 = user:read-write-execute group:read-execute world:read-execute: $ sudo chmod -R 755 /opt/infoobjects/spark Move to the spark home directory: $ cd /opt/infoobjects/spark Create the symbolic link: $ sudo ln -s /etc/spark conf Append to PATH in .bashrc: $ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc Open a new terminal. Create the log directory in /var: $ sudo mkdir -p /var/log/spark Make hduser the owner of the Spark log directory. $ sudo chown -R hduser:hduser /var/log/spark Create the Spark tmp directory: $ mkdir /tmp/spark Configure Spark with the help of the following command lines: $ cd /etc/spark$ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop">> spark-env.sh$ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop">> spark-env.sh$ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh$ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh Building the Spark source code with Maven Installing Spark using binaries works fine in most cases. For advanced cases, such as the following (but not limited to), compiling from the source code is a better option: Compiling for a specific Hadoop version Adding the Hive integration Adding the YARN integration Getting ready The following are the prerequisites for this recipe to work: Java 1.6 or a later version Maven 3.x How to do it... The following are the steps to build the Spark source code with Maven: Increase MaxPermSize for heap: $ echo "export _JAVA_OPTIONS="-XX:MaxPermSize=1G"" >> /home/hduser/.bashrc Open a new terminal window and download the Spark source code from GitHub: $ wget https://github.com/apache/spark/archive/branch-1.4.zip Unpack the archive: $ gunzip branch-1.4.zip Move to the spark directory: $ cd spark Compile the sources with these flags: Yarn enabled, Hadoop version 2.4, Hive enabled, and skipping tests for faster compilation: $ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package Move the conf folder to the etc folder so that it can be made a symbolic link: $ sudo mv spark/conf /etc/ Move the spark directory to /opt as it's an add-on software package: $ sudo mv spark /opt/infoobjects/spark Change the ownership of the spark home directory to root: $ sudo chown -R root:root /opt/infoobjects/spark Change the permissions of the spark home directory 0755 = user:rwx group:r-x world:r-x: $ sudo chmod -R 755 /opt/infoobjects/spark Move to the spark home directory: $ cd /opt/infoobjects/spark Create a symbolic link: $ sudo ln -s /etc/spark conf Put the Spark executable in the path by editing .bashrc: $ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc Create the log directory in /var: $ sudo mkdir -p /var/log/spark Make hduser the owner of the Spark log directory: $ sudo chown -R hduser:hduser /var/log/spark Create the Spark tmp directory: $ mkdir /tmp/spark Configure Spark with the help of the following command lines: $ cd /etc/spark$ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop">> spark-env.sh$ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop">> spark-env.sh$ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh$ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh Summary In this article, we learned what Apache Spark is, how we can install Spark from binaries, and how to build Spark source code with Maven. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [Article] YARN and Hadoop [Article] Hadoop and SQL [Article]

0
0
2159

article-image-clustering-and-other-unsupervised-learning-methods

Packt

09 Jul 2015

19 min read

Clustering and Other Unsupervised Learning Methods

Packt

09 Jul 2015

19 min read

0
0
32729

article-image-working-large-data-sources

Packt

08 Jul 2015

20 min read

Working with large data sources

Packt

08 Jul 2015

20 min read

In this article, by Duncan M. McGreggor, author of the book Mastering matplotlib, we come across the use of NumPy in the world of matplotlib and big data, problems with large data sources, and the possible solutions to these problems. (For more resources related to this topic, see here.) Most of the data that users feed into matplotlib when generating plots is from NumPy. NumPy is one of the fastest ways of processing numerical and array-based data in Python (if not the fastest), so this makes sense. However by default, NumPy works on in-memory database. If the dataset that you want to plot is larger than the total RAM available on your system, performance is going to plummet. In the following section, we're going to take a look at an example that illustrates this limitation. But first, let's get our notebook set up, as follows: In [1]: import matplotlib matplotlib.use('nbagg') %matplotlib inline Here are the modules that we are going to use: In [2]: import glob, io, math, os import psutil import numpy as np import pandas as pd import tables as tb from scipy import interpolate from scipy.stats import burr, norm import matplotlib as mpl import matplotlib.pyplot as plt from IPython.display import Image We'll use the custom style sheet that we created earlier, as follows: In [3]: plt.style.use("../styles/superheroine-2.mplstyle") An example problem To keep things manageable for an in-memory example, we're going to limit our generated dataset to 100 million points by using one of SciPy's many statistical distributions, as follows: In [4]: (c, d) = (10.8, 4.2) (mean, var, skew, kurt) = burr.stats(c, d, moments='mvsk') The Burr distribution, also known as the Singh–Maddala distribution, is commonly used to model household income. Next, we'll use the burr object's method to generate a random population with our desired count, as follows: In [5]: r = burr.rvs(c, d, size=100000000) Creating 100 million data points in the last call took about 10 seconds on a moderately recent workstation, with the RAM usage peaking at about 2.25 GB (before the garbage collection kicked in). Let's make sure that it's the size we expect, as follows: In [6]: len(r) Out[6]: 100000000 If we save this to a file, it weighs in at about three-fourths of a gigabyte: In [7]: r.tofile("../data/points.bin") In [8]: ls -alh ../data/points.bin -rw-r--r-- 1 oubiwann staff 763M Mar 20 11:35 points.bin This actually does fit in the memory on a machine with a RAM of 8 GB, but generating much larger files tends to be problematic. We can reuse it multiple times though, to reach a size that is larger than what can fit in the system RAM. Before we do this, let's take a look at what we've got by generating a smooth curve for the probability distribution, as follows: In [9]: x = np.linspace(burr.ppf(0.0001, c, d), burr.ppf(0.9999, c, d), 100) y = burr.pdf(x, c, d) In [10]: (figure, axes) = plt.subplots(figsize=(20, 10)) axes.plot(x, y, linewidth=5, alpha=0.7) axes.hist(r, bins=100, normed=True) plt.show() The following plot is the result of the preceding code: Our plot of the Burr probability distribution function, along with the 100-bin histogram with a sample size of 100 million points, took about 7 seconds to render. This is due to the fact that NumPy handles most of the work, and we only displayed a limited number of visual elements. What would happen if we did try to plot all the 100 million points? This can be checked by the following code: In [11]: (figure, axes) = plt.subplots() axes.plot(r) plt.show() formatters.py:239: FormatterWarning: Exception in image/png formatter: Allocated too many blocks After about 30 seconds of crunching, the preceding error was thrown—the Agg backend (a shared library) simply couldn't handle the number of artists required to render all the points. But for now, this case clarifies the point that we stated a while back—our first plot rendered relatively quickly because we were selective about the data we chose to present, given the large number of points with which we are working. However, let's say we have data from the files that are too large to fit into the memory. What do we do about this? Possible ways to address this include the following: Moving the data out of the memory and into the filesystem Moving the data off the filesystem and into the databases We will explore examples of these in the following section. Big data on the filesystem The first of the two proposed solutions for large datasets involves not burdening the system memory with data, but rather leaving it on the filesystem. There are several ways to accomplish this, but the following two methods in particular are the most common in the world of NumPy and matplotlib: NumPy's memmap function: This function creates memory-mapped files that are useful if you wish to access small segments of large files on the disk without having to read the whole file into the memory. PyTables: This is a package that is used to manage hierarchical datasets. It is built on the top of the HDF5 and NumPy libraries and is designed to efficiently and easily cope with extremely large amounts of data. We will examine each in turn. NumPy's memmap function Let's restart the IPython kernel by going to the IPython menu at the top of notebook page, selecting Kernel, and then clicking on Restart. When the dialog box pops up, click on Restart. Then, re-execute the first few lines of the notebook by importing the required libraries and getting our style sheet set up. Once the kernel is restarted, take a look at the RAM utilization on your system for a fresh Python process for the notebook: In [4]: Image("memory-before.png") Out[4]: The following screenshot shows the RAM utilization for a fresh Python process: Now, let's load the array data that we previously saved to disk and recheck the memory utilization, as follows: In [5]: data = np.fromfile("../data/points.bin") data_shape = data.shape data_len = len(data) data_len Out[5]: 100000000 In [6]: Image("memory-after.png") Out[6]: The following screenshot shows the memory utilization after loading the array data: This took about five seconds to load, with the memory consumption equivalent to the file size of the data. This means that if we wanted to build some sample data that was too large to fit in the memory, we'd need about 11 of those files concatenated, as follows: In [7]: 8 * 1024 Out[7]: 8192 In [8]: filesize = 763 8192 / filesize Out[8]: 10.73656618610747 However, this is only if the entire memory was available. Let's see how much memory is available right now, as follows: In [9]: del data In [10]: psutil.virtual_memory().available / 1024**2 Out[10]: 2449.1796875 That's 2.5 GB. So, to overrun our RAM, we'll just need a fraction of the total. This is done in the following way: In [11]: 2449 / filesize Out[11]: 3.2096985583224114 The preceding output means that we only need four of our original files to create a file that won't fit in memory. However, in the following section, we will still use 11 files to ensure that data, if loaded into the memory, will be much larger than the memory. How do we create this large file for demonstration purposes (knowing that in a real-life situation, the data would already be created and potentially quite large)? We can try to use numpy.tile to create a file of the desired size (larger than memory), but this can make our system unusable for a significant period of time. Instead, let's use numpy.memmap, which will treat a file on the disk as an array, thus letting us work with data that is too large to fit into the memory. Let's load the data file again, but this time as a memory-mapped array, as follows: In [12]: data = np.memmap( "../data/points.bin", mode="r", shape=data_shape) The loading of the array to a memmap object was very quick (compared to the process of bringing the contents of the file into the memory), taking less than a second to complete. Now, let's create a new file to write the data to. This file must be larger in size as compared to our total system memory (if held on in-memory database, it will be smaller on the disk): In [13]: big_data_shape = (data_len * 11,) big_data = np.memmap( "../data/many-points.bin", dtype="uint8", mode="w+", shape=big_data_shape) The preceding code creates a 1 GB file, which is mapped to an array that has the shape we requested and just contains zeros: In [14]: ls -alh ../data/many-points.bin -rw-r--r-- 1 oubiwann staff 1.0G Apr 2 11:35 many-points.bin In [15]: big_data.shape Out[15]: (1100000000,) In [16]: big_data Out[16]: memmap([0, 0, 0, ..., 0, 0, 0], dtype=uint8) Now, let's fill the empty data structure with copies of the data we saved to the 763 MB file, as follows: In [17]: for x in range(11): start = x * data_len end = (x * data_len) + data_len big_data[start:end] = data big_data Out[17]: memmap([ 90, 71, 15, ..., 33, 244, 63], dtype=uint8) If you check your system memory before and after, you will only see minimal changes, which confirms that we are not creating an 8 GB data structure on in-memory. Furthermore, checking your system only takes a few seconds. Now, we can do some sanity checks on the resulting data and ensure that we have what we were trying to get, as follows: In [18]: big_data_len = len(big_data) big_data_len Out[18]: 1100000000 In [19]: data[100000000 – 1] Out[19]: 63 In [20]: big_data[100000000 – 1] Out[20]: 63 Attempting to get the next index from our original dataset will throw an error (as shown in the following code), since it didn't have that index: In [21]: data[100000000] ----------------------------------------------------------- IndexError Traceback (most recent call last) ... IndexError: index 100000000 is out of bounds ... But our new data does have an index, as shown in the following code: In [22]: big_data[100000000 Out[22]: 90 And then some: In [23]: big_data[1100000000 – 1] Out[23]: 63 We can also plot data from a memmaped array without having a significant lag time. However, note that in the following code, we will create a histogram from 1.1 million points of data, so the plotting won't be instantaneous: In [24]: (figure, axes) = plt.subplots(figsize=(20, 10)) axes.hist(big_data, bins=100) plt.show() The following plot is the result of the preceding code: The plotting took about 40 seconds to generate. The odd shape of the histogram is due to the fact that, with our data file-hacking, we have radically changed the nature of our data since we've increased the sample size linearly without regard for the distribution. The purpose of this demonstration wasn't to preserve a sample distribution, but rather to show how one can work with large datasets. What we have seen is not too shabby. Thanks to NumPy, matplotlib can work with data that is too large for memory, even if it is a bit slow iterating over hundreds of millions of data points from the disk. Can matplotlib do better? HDF5 and PyTables A commonly used file format in the scientific computing community is Hierarchical Data Format (HDF). HDF is a set of file formats (namely HDF4 and HDF5) that were originally developed at the National Center for Supercomputing Applications (NCSA), a unit of the University of Illinois at Urbana-Champaign, to store and organize large amounts of numerical data. The NCSA is a great source of technical innovation in the computing industry—a Telnet client, the first graphical web browser, a web server that evolved into the Apache HTTP server, and HDF, which is of particular interest to us, were all developed here. It is a little known fact that NCSA's web browser code was the ancestor to both the Netscape web browser as well as a prototype of Internet Explorer that was provided to Microsoft by a third party. HDF is supported by Python, R, Julia, Java, Octave, IDL, and MATLAB, to name a few. HDF5 offers significant improvements and useful simplifications over HDF4. It uses B-trees to index table objects and, as such, works well for write-once/read-many time series data. Common use cases span fields such as meteorological studies, biosciences, finance, and aviation. The HDF5 files of multiterabyte sizes are common in these applications. Its typically constructed from the analyses of multiple HDF5 source files, thus providing a single (and often extensive) source of grouped data for a particular application. The PyTables library is built on the top of the Python HDF5 library and NumPy. As such, it not only provides access to one of the most widely used large data file formats in the scientific computing community, but also links data extracted from these files with the data types and objects provided by the fast Python numerical processing library. PyTables is also used in other projects. Pandas wraps PyTables, thus extending its convenient in-memory data structures, functions, and objects to large on-disk files. To use HDF data with Pandas, you'll want to create pandas.HDFStore, read from the HDF data sources with pandas.read_hdf, or write to one with pandas.to_hdf. Files that are too large to fit in the memory may be read and written by utilizing chunking techniques. Pandas does support the disk-based DataFrame operations, but these are not very efficient due to the required assembly on columns of data upon reading back into the memory. One project to keep an eye on under the PyData umbrella of projects is Blaze. It's an open wrapper and a utility framework that can be used when you wish to work with large datasets and generalize actions such as the creation, access, updates, and migration. Blaze supports not only HDF, but also SQL, CSV, and JSON. The API usage between Pandas and Blaze is very similar, and it offers a nice tool for developers who need to support multiple backends. In the following example, we will use PyTables directly to create an HDF5 file that is too large to fit in the memory (for an 8GB RAM machine). We will follow the following steps: Create a series of CSV source data files that take up approximately 14 GB of disk space Create an empty HDF5 file Create a table in the HDF5 file and provide the schema metadata and compression options Load the CSV source data into the HDF5 table Query the new data source once the data has been migrated Remember the temperature precipitation data for St. Francis, in Kansas, USA, from a previous notebook? We are going to generate random data with similar columns for the purposes of the HDF5 example. This data will be generated from a normal distribution, which will be used in the guise of the temperature and precipitation data for hundreds of thousands of fictitious towns across the globe for the last century, as follows: In [25]: head = "country,town,year,month,precip,tempn" row = "{},{},{},{},{},{}n" filename = "../data/{}.csv" town_count = 1000 (start_year, end_year) = (1894, 2014) (start_month, end_month) = (1, 13) sample_size = (1 + 2 * town_count * (end_year – start_year) * (end_month - start_month)) countries = range(200) towns = range(town_count) years = range(start_year, end_year) months = range(start_month, end_month) for country in countries: with open(filename.format(country), "w") as csvfile: csvfile.write(head) csvdata = "" weather_data = norm.rvs(size=sample_size) weather_index = 0 for town in towns: for year in years: for month in months: csvdata += row.format( country, town, year, month, weather_data[weather_index], weather_data[weather_index + 1]) weather_index += 2 csvfile.write(csvdata) Note that we generated a sample data population that was twice as large as the expected size in order to pull both the simulated temperature and precipitation data at the same time (from the same set). This will take about 30 minutes to run. When complete, you will see the following files: In [26]: ls -rtm ../data/*.csv ../data/0.csv, ../data/1.csv, ../data/2.csv, ../data/3.csv, ../data/4.csv, ../data/5.csv, ... ../data/194.csv, ../data/195.csv, ../data/196.csv, ../data/197.csv, ../data/198.csv, ../data/199.csv A quick look at just one of the files reveals the size of each, as follows: In [27]: ls -lh ../data/0.csv -rw-r--r-- 1 oubiwann staff 72M Mar 21 19:02 ../data/0.csv With each file that is 72 MB in size, we have data that takes up 14 GB of disk space, which exceeds the size of the RAM of the system in question. Furthermore, running queries against so much data in the .csv files isn't going to be very efficient. It's going to take a long time. So what are our options? Well, to read this data, HDF5 is a very good fit. In fact, it is designed for jobs like this. We will use PyTables to convert the .csv files to a single HDF5. We'll start by creating an empty table file, as follows: In [28]: tb_name = "../data/weather.h5t" h5 = tb.open_file(tb_name, "w") h5 Out[28]: File(filename=../data/weather.h5t, title='', mode='w', root_uep='/', filters=Filters( complevel=0, shuffle=False, fletcher32=False, least_significant_digit=None)) / (RootGroup) '' Next, we'll provide some assistance to PyTables by indicating the data types of each column in our table, as follows: In [29]: data_types = np.dtype( [("country", "<i8"), ("town", "<i8"), ("year", "<i8"), ("month", "<i8"), ("precip", "<f8"), ("temp", "<f8")]) Also, let's define a compression filter that can be used by PyTables when saving our data, as follows: In [30]: filters = tb.Filters(complevel=5, complib='blosc') Now, we can create a table inside our new HDF5 file, as follows: In [31]: tab = h5.create_table( "/", "weather", description=data_types, filters=filters) With that done, let's load each CSV file, read it in chunks so that we don't overload the memory, and then append it to our new HDF5 table, as follows: In [32]: for filename in glob.glob("../data/*.csv"): it = pd.read_csv(filename, iterator=True, chunksize=10000) for chunk in it: tab.append(chunk.to_records(index=False)) tab.flush() Depending on your machine, the entire process of loading the CSV file, reading it in chunks, and appending to a new HDF5 table can take anywhere from 5 to 10 minutes. However, what started out as a collection of the .csv files that weigh in at 14 GB is now a single compressed 4.8 GB HDF5 file, as shown in the following code: In [33]: h5.get_filesize() Out[33]: 4758762819 Here's the metadata for the PyTables-wrapped HDF5 table after the data insertion: In [34]: tab Out[34]: /weather (Table(288000000,), shuffle, blosc(5)) '' description := { "country": Int64Col(shape=(), dflt=0, pos=0), "town": Int64Col(shape=(), dflt=0, pos=1), "year": Int64Col(shape=(), dflt=0, pos=2), "month": Int64Col(shape=(), dflt=0, pos=3), "precip": Float64Col(shape=(), dflt=0.0, pos=4), "temp": Float64Col(shape=(), dflt=0.0, pos=5)} byteorder := 'little' chunkshape := (1365,) Now that we've created our file, let's use it. Let's excerpt a few lines with an array slice, as follows: In [35]: tab[100000:100010] Out[35]: array([(0, 69, 1947, 5, -0.2328834718674, 0.06810312195695), (0, 69, 1947, 6, 0.4724989007889, 1.9529216219569), (0, 69, 1947, 7, -1.0757216683235, 1.0415374480545), (0, 69, 1947, 8, -1.3700249968748, 3.0971874991576), (0, 69, 1947, 9, 0.27279758311253, 0.8263207523831), (0, 69, 1947, 10, -0.0475253104621, 1.4530808932953), (0, 69, 1947, 11, -0.7555493935762, -1.2665440609117), (0, 69, 1947, 12, 1.540049376928, 1.2338186532516), (0, 69, 1948, 1, 0.829743501445, -0.1562732708511), (0, 69, 1948, 2, 0.06924900463163, 1.187193711598)], dtype=[('country', '<i8'), ('town', '<i8'), ('year', '<i8'), ('month', '<i8'), ('precip', '<f8'), ('temp', '<f8')]) In [36]: tab[100000:100010]["precip"] Out[36]: array([-0.23288347, 0.4724989 , -1.07572167, -1.370025 , 0.27279758, -0.04752531, -0.75554939, 1.54004938, 0.8297435 , 0.069249 ]) When we're done with the file, we do the same thing that we would do with any other file-like object: In [37]: h5.close() If we want to work with it again, simply load it, as follows: In [38]: h5 = tb.open_file(tb_name, "r") tab = h5.root.weather Let's try plotting the data from our HDF5 file: In [39]: (figure, axes) = plt.subplots(figsize=(20, 10)) axes.hist(tab[:1000000]["temp"], bins=100) plt.show() Here's a plot for the first million data points: This histogram was rendered quickly, with a much better response time than what we've seen before. Hence, the process of accessing the HDF5 data is very fast. The next question might be "What about executing calculations against this data?" Unfortunately, running the following will consume an enormous amount of RAM: tab[:]["temp"].mean() We've just asked for all of the data—all of its 288 million rows. We are going to end up loading everything into the RAM, grinding the average workstation to a halt. Ideally though, when you iterate through the source data and create the HDF5 file, you also crunch the numbers that you will need, adding supplemental columns or groups to the HDF5 file that can be used later by you and your peers. If we have data that we will mostly be selecting (extracting portions) and which has already been crunched and grouped as needed, HDF5 is a very good fit. This is why one of the most common use cases that you see for HDF5 is the sharing and distribution of the processed data. However, if we have data that we need to process repeatedly, then we will either need to use another method besides the one that will cause all the data to be loaded into the memory, or find a better match for our data processing needs. We saw in the previous section that the selection of data is very fast in HDF5. What about calculating the mean for a small section of data? If we've got a total of 288 million rows, let's select a divisor of the number that gives us several hundred thousand rows at a time—2,81,250 rows, to be more precise. Let's get the mean for the first slice, as follows: In [40]: tab[0:281250]["temp"].mean() Out[40]: 0.0030696632864265312 This took about 1 second to calculate. What about iterating through the records in a similar fashion? Let's break up the 288 million records into chunks of the same size; this will result in 1024 chunks. We'll start by getting the ranges needed for an increment of 281,250 and then, we'll examine the first and the last row as a sanity check, as follows: In [41]: limit = 281250 ranges = [(x * limit, x * limit + limit) for x in range(2 ** 10)] (ranges[0], ranges[-1]) Out[41]: ((0, 281250), (287718750, 288000000)) Now, we can use these ranges to generate the mean for each chunk of 281,250 rows of temperature data and print the total number of means that we generated to make sure that we're getting our counts right, as follows: In [42]: means = [tab[x * limit:x * limit + limit]["temp"].mean() for x in range(2 ** 10)] len(means) Out[42]: 1024 Depending on your machine, that should take between 30 and 60 seconds. With this work done, it's now easy to calculate the mean for all of the 288 million points of temperature data: In [43]: sum(means) / len(means) Out[43]: -5.3051780413782918e-05 Through HDF5's efficient file format and implementation, combined with the splitting of our operations into tasks that would not copy the HDF5 data into memory, we were able to perform calculations across a significant fraction of a billion records in less than a minute. HDF5 even supports parallelization. So, this can be improved upon with a little more time and effort. However, there are many cases where HDF5 is not a practical choice. You may have some free-form data, and preprocessing it will be too expensive. Alternatively, the datasets may be actually too large to fit on a single machine. This is when you may consider using matplotlib with distributed data. Summary In this article, we covered the role of NumPy in the world of big data and matplotlib as well as the process and problems in working with large data sources. Also, we discussed the possible solutions to these problems using NumPy's memmap function and HDF5 and PyTables. Resources for Article: Further resources on this subject: First Steps [article] Introducing Interactive Plotting [article] The plot function [article]

0
0
5127

Packt

07 Jul 2015

9 min read

Transactions in Redis

Packt

07 Jul 2015

9 min read

0
1
4199

Packt

06 Jul 2015

8 min read

CoreOS – Overview and Installation

Packt

06 Jul 2015

8 min read

In this article by Rimantas Mocevicius, author of the book CoreOS Essentials, has described CoreOS is often as Linux for massive server deployments, but it can also run easily as a single host on bare-metal, cloud servers, and as a virtual machine on your computer as well. It is designed to run application containers as docker and rkt, and you will learn about its main features later in this article. This article is a practical, example-driven guide to help you learn about the essentials of the CoreOS Linux operating system. We assume that you have experience with VirtualBox, Vagrant, Git, Bash shell scripting and the command line (terminal on UNIX-like computers), and you have already installed VirtualBox, Vagrant, and git on your Mac OS X or Linux computer. As for a cloud installation, we will use Google Cloud's Compute Engine instances. By the end of this article, you will hopefully be familiar with setting up CoreOS on your laptop or desktop, and on the cloud. You will learn how to set up a local computer development machine and a cluster on a local computer and in the cloud. Also, we will cover etcd, systemd, fleet, cluster management, deployment setup, and production clusters. In this article, you will learn how CoreOS works and how to carry out a basic CoreOS installation on your laptop or desktop with the help of VirtualBox and Vagrant. We will basically cover two topics in this article: An overview of CoreOS Installing the CoreOS virtual machine (For more resources related to this topic, see here.) An overview of CoreOS CoreOS is a minimal Linux operation system built to run docker and rkt containers (application containers). By default, it is designed to build powerful and easily manageable server clusters. It provides automatic, very reliable, and stable updates to all machines, which takes away a big maintenance headache from sysadmins. And, by running everything in application containers, such setup allows you to very easily scale servers and applications, replace faulty servers in a fraction of a second, and so on. How CoreOS works CoreOS has no package manager, so everything needs to be installed and used via docker containers. Moreover, it is 40 percent more efficient in RAM usage than an average Linux installation, as shown in this diagram: CoreOS utilizes an active/passive dual-partition scheme to update itself as a single unit, instead of using a package-by-package method. Its root partition is read-only and changes only when an update is applied. If the update is unsuccessful during reboot time, then it rolls back to the previous boot partition. The following image shows OS updated gets applied to partition B (passive) and after reboot it becomes the active to boot from. The docker and rkt containers run as applications on CoreOS. Containers can provide very good flexibility for application packaging and can start very quickly—in a matter of milliseconds. The following image shows the simplicity of CoreOS. Bottom part is Linux OS, the second level is etcd/fleet with docker daemon and the top level are running containers on the server. By default, CoreOS is designed to work in a clustered form, but it also works very well as a single host. It is very easy to control and run application containers across cluster machines with fleet and use the etcd service discovery to connect them as it shown in the following image. CoreOS can be deployed easily on all major cloud providers, for example, Google Cloud, Amazon Web Services, Digital Ocean, and so on. It runs very well on bare-metal servers as well. Moreover, it can be easily installed on a laptop or desktop with Linux, Mac OS X, or Windows via Vagrant, with VirtualBox or VMware virtual machine support. This short overview should throw some light on what CoreOS is about and what it can do. Let's now move on to the real stuff and install CoreOS on to our laptop or desktop machine. Installing the CoreOS virtual machine To use the CoreOS virtual machine, you need to have VirtualBox, Vagrant, and git installed on your computer. In the following examples, we will install CoreOS on our local computer, which will serve as a virtual machine on VirtualBox. Okay, let's get started! Cloning the coreos-vagrant GitHub project Let‘s clone this project and get it running. In your terminal (from now on, we will use just the terminal phrase and use $ to label the terminal prompt), type the following command: $ git clone https://github.com/coreos/coreos-vagrant/ This will clone from the GitHub repository to the coreos-vagrant folder on your computer. Working with cloud-config To start even a single host, we need to provide some config parameters in the cloud-config format via the user data file. In your terminal, type this: $ cd coreos-vagrant$ mv user-data.sample user-data The user data should have content like this (the coreos-vagrant Github repository is constantly changing, so you might see a bit of different content when you clone the repository): #cloud-config coreos: etcd2: #generate a new token for each unique cluster from “ https://discovery.etcd.io/new #discovery: https://discovery.etcd.io/<token> # multi-region and multi-cloud deployments need to use “ $public_ipv4 advertise-client-urls: http://$public_ipv4:2379 initial-advertise-peer-urls: http://$private_ipv4:2380 # listen on both the official ports and the legacy ports # legacy ports can be omitted if your application doesn‘t “ depend on them listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001 listen-peer-urls: “ http://$private_ipv4:2380,http://$private_ipv4:7001 fleet: public-ip: $public_ipv4 flannel: interface: $public_ipv4 units: - name: etcd2.service command: start - name: fleet.service command: start - name: docker-tcp.socket command: start enable: true content: | [Unit] Description=Docker Socket for the API [Socket] ListenStream=2375 Service=docker.service BindIPv6Only=both [Install] WantedBy=sockets.target Replace the text between the etcd2: and fleet: lines to look this: etcd2: name: core-01 initial-advertise-peer-urls: http://$private_ipv4:2380 listen-peer-urls: “ http://$private_ipv4:2380,http://$private_ipv4:7001 initial-cluster-token: core-01_etcd initial-cluster: core-01=http://$private_ipv4:2380 initial-cluster-state: new advertise-client-urls: “ http://$public_ipv4:2379,http://$public_ipv4:4001 listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001 fleet: You can also download the latest user-data file from https://github.com/rimusz/coreos-essentials-book/blob/master/Chapter1/user-data. This should be enough to bootstrap a single-host CoreOS VM with etcd, fleet, and docker running there. Startup and SSH It's now time to boot our CoreOS VM and log in to its console using ssh. Let's boot our first CoreOS VM host. To do so, using the terminal, type the following command: $ vagrant up This will trigger vagrant to download the latest CoreOS alpha (this is the default channel set in the config.rb file, and it can easily be changed to beta, or stable) channel image and the lunch VM instance. You should see something like this as the output in your terminal: CoreOS VM has booted up, so let's open the ssh connection to our new VM using the following command: $ vagrant ssh It should show something like this: CoreOS alpha (some version) core@core-01 ~ $ Perfect! Let's verify that etcd, fleet, and docker are running there. Here are the commands required and the corresponding screenshots of the output: $ systemctl status etcd2 To check the status of fleet, type this: $ systemctl status fleet To check the status of docker, type the following command: $ docker version Lovely! Everything looks fine. Thus, we've got our first CoreOS VM up and running in VirtualBox. Summary In this article, we saw what CoreOS is and how it is installed. We covered a simple CoreOS installation on a local computer with the help of Vagrant and VirtualBox, and checked whether etcd, fleet, and docker are running there. Resources for Article: Further resources on this subject: Core Data iOS: Designing a Data Model and Building Data Objects [article] Clustering [article] Deploying a Play application on CoreOS and Docker [article]

0
0
1771

article-image-introduction-ggplot2-and-plotting-environments-r

Packt

25 Jun 2015

15 min read

Introduction to ggplot2 and the plotting environments in R

Packt

25 Jun 2015

15 min read

In this article by Donato Teutonico, author of the book ggplot2 Essentials, we are going to explore different plotting environments in R and subsequently learn about the package, ggplot2. R provides a complete series of options available for realizing graphics, which make this software quite advanced concerning data visualization. The core of the graphics visualization in R is within the package grDevices, which provides the basic structure of data plotting, as for instance the colors and fonts used in the plots. Such graphic engine was then used as starting point in the development of more advanced and sophisticated packages for data visualization; the most commonly used being graphics and grid. (For more resources related to this topic, see here.) The graphics package is often referred to as the base or traditional graphics environment, since historically it was already available among the default packages delivered with the base installation of R and it provides functions that allow to the generation of complete plots. The grid package developed by Paul Murrell, on the other side, provides an alternative set of graphics tools. This package does not provide directly functions that generate complete plots, so it is not frequently used directly for generating graphics, but it was used in the development of advanced data visualization packages. Among the grid-based packages, the most widely used are lattice and ggplot2, although they are built implementing different visualization approaches. In fact lattice was build implementing the Trellis plots, while ggplot2 was build implementing the grammar of graphics. A diagram representing the connections between the tools just mentioned is represented in the Figure 1. Figure 1: Overview of the most widely used R packages for graphics Just keep in mind that this is not a complete overview of the packages available, but simply a small snapshot on the main packages used for data visualization in R, since many other packages are built on top of the tools just mentioned. If you would like to get a more complete overview of the graphics tools available in R, you may have a look at the web page of the R project summarizing such tools, http://cran.r-project.org/web/views/Graphics.html. ggplot2 and the Grammar of Graphics The package ggplot2 was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As in the case of lattice, this package is also based on grid, providing a series of high-level functions which allow the creation of complete plots. The ggplot2 package provides an interpretation and extension of the principles of the book The Grammar of Graphics by Leland Wilkinson. Briefly, the Grammar of Graphics assumes that a statistical graphic is a mapping of data to aesthetic attributes and geometric objects used to represent the data, like points, lines, bars, and so on. Together with the aesthetic attributes, the plot can also contain statistical transformation or grouping of the data. As in lattice, also in ggplot2 we have the possibility to split data by a certain variable obtaining a representation of each subset of data in an independent sub-plot; such representation in ggplot2 is called faceting. In a more formal way, the main components of the grammar of graphics are: the data and their mapping, the aesthetic, the geometric objects, the statistical transformations, scales, coordinates and faceting. A more detailed description of these elements is provided along the book ggplot2 Essentials, but this is a summary of the general principles The data that must be visualized are mapped to aesthetic attributes which define how the data should be perceived The geometric objects describe what is actually represented on the plot like lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw The statistical transformations are transformations which are applied to the data to group them; an example of statistical transformations would be, for instance, the smooth line or the regression lines of the previous examples or the binning of the histograms. Scales represent the connection between the aesthetic spaces with the actual values which should be represented. Scales maybe also be used to draw legends The coordinates represent the coordinate system in which the data are drawn The faceting, which we have already mentioned, is a grouping of data in subsets defined by a value of one variable In ggplot2 there are two main high-level functions, capable of creating directly creating a plot, qplot() and ggplot(); qplot() stands for quick plot and it is a simple function with serve a similar purpose to the plot() function in graphics. The function ggplot() on the other side is a much more advanced function which allow the user to have a deep control of the plot layout and details. In this article we will see some examples of qplot() in order to provide you with a taste of the typical plots which can be realized with ggplot2, but for more advanced data visualization the function ggplot(), is much more flexible. If you have a look on the different forums of R programming, there is quite some discussion about which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plot, where basically only the data should be represented and some minor modification of standard layout, the qplot() function will do the job. On the other side, if you would need to apply particular transformations to the data or simply if you would like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend to focus in learning the code of ggplot(). In the code below you will see an example of plot realized with ggplot2 where you can identify some of the components of the grammar of graphics. The example is realized with the function ggplot() which allow a more direct comparison with the grammar, but just below you may also find the corresponding code for the use of qplot(). Both codes generate the graph depicted on Figure 2. require(ggplot2) ## Load ggplot2 data(Orange) # Load the data ggplot(data=Orange, ## Data used aes(x=circumference,y=age, color=Tree))+ ##mapping to aesthetic geom_point()+ ##Add geometry (plot with data points) stat_smooth(method="lm",se=FALSE) ##Add statistics(linear regression) ### Corresponding code with qplot() qplot(circumference,age,data=Orange, ## Data used color=Tree, ## Aestetic mapping geom=c("point","smooth"),method="lm",se=FALSE) This simple example can give you an idea of the role of each portion of code in a ggplot2 graph; you have seen how the main function body create the connection between the data and the aesthetic we are interested to represent and how, on top of this, you add the components of the plot like in this case the geometry element of points and the statistical element of regression. You can also notice how the components which need to be added to the main function call are included using the + sign. One more thing worth to mention at this point, is the if you run just the main body function in the ggplot() function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attributes, in this case geom_point(). This is perfectly in line with the grammar of graphics, since as we have seen the geometry represent the actual connection between the data and what is represented on the plot. Is in fact at this stage that we specify we are interested in having points representing the data, before that nothing was specified yet about which plot we were interested in drawing. Figure 2: Example of plot of Orange dataset with ggplot2 The qplot() function The qplot (quick plot) function is a basic high level function of ggplot2. The general syntax that you should use with this function is the following qplot(x, y, data, colour, shape, size, facets, geom, stat) where x and y represent the variables to plot (y is optional with a default value NULL) data define the dataset containing the variables colour, shape and size are the aesthetic arguments that can be mapped on additional variables facets define the optional faceting of the plot based on one variable contained in the dataset geom allows you to select the actual visualization of the data, which basically will represent the plot which will be generated. Possible values are point, line or boxplot, but we will see several different examples in the next pages stat define the statistics to be used on the data These options represents the most important options available in qplot(). You may find a descriptions of the other function arguments in the help page of the function accessible with ?qplot, or on the ggplot2 website under the following link http://docs.ggplot2.org/0.9.3/qplot.html. Most of the options just discussed can be applied to different types of plots, since most of the concepts of the grammar of graphics, embedded in the code, may be translated from one plot to the other. For instance, you may use the argument colour to do an aesthetics mapping to one variable; these same concepts can in example be applied to scatterplots as well as histograms. Exactly the same principle would be applied to facets, which can be used for splitting plots independently on the type of plot considered. Histograms and density plots Histograms are plots used to explore how one (or more) quantitative variables are distributed. To show some examples of histograms we will use the iris data. This dataset contains measurements in centimetres of the variables sepal length and width and petal length and width for 50 flowers from each of three species of the flower iris: iris setosa, versicolor, and virginica. You may find more details running ?iris. The geometric attribute used to produce histograms is simply by specifying geom=”histogram” in the qplot() function. This default histogram will represent the variable specified on the x axis while the y axis will represent the number of elements in each bin. One other very useful way of representing distributions is to look at the kernel density function, which will basically produce a sort of continuous histogram instead of different bins by estimating the probability density function. For example let’s plot the petal length of all the three species of iris as histogram and density plot. data(iris) ## Load data qplot(Petal.Length, data=iris, geom="histogram") ## Histogram qplot(Petal.Length, data=iris, geom="density") ## Density plot The output of this code is showed in Figure 3. Figure 3: Histogram (left) and density plot (right) As you can see in both plots of Figure 3, it appears that the data are not distributed homogenously, but there are at least two distinct distribution clearly separated. This is very reasonably due to a different distribution for one of the iris species. To try to verify if the two distributions are indeed related to specie differences, we could generate the same plot using aesthetic attributes and have a different colour for each subtype of iris. To do this, we can simply map the fill to the Species column in the dataset; also in this case we can do that for the histogram and the density plot too. Below you may see the code we built, and in Figure 4 the resulting output. qplot(Petal.Length, data=iris, geom="histogram", colour=Species, fill=Species) qplot(Petal.Length, data=iris, geom="density", colour=Species, fill=Species) Figure 4: Histogram (left) and density plot (right) with aesthetic attribute for colour and fill In the distribution we can see that the lower data are coming from the Setosa species, while the two other distributions are partly overlapping. Scatterplots Scatterplots are probably the most common plot, since they are usually used to display the relationship between two quantitative variables. When two variables are provided, ggplot2 will make a scatterplot by default. For our example on how to build a scatterplot, we will use a dataset called ToothGrowth, which is available in the base R installation. In this dataset are reported measurements of teeth length of 10 guinea pig for three different doses of vitamin C (0.5, 1, and 2 mg) delivered in two different ways, as orange juice or as ascorbic acid (a compound having vitamin C activity). You can find, as usual, details on these data on the dataset help page at ?ToothGrowth. We are interested in seeing how the length of the teeth changed for the different doses. We are not able to distinguish among the different guinea pigs, since this information is not contained in the data, so for the moment we will plot just all the data we have. So let’s load the dataset and do a basic plot of the dose vs. length. require(ggplot2) data(ToothGrowth) qplot(dose, len, data=ToothGrowth, geom="point") ##Alternative coding qplot(dose, len, data=ToothGrowth) The resulting plot is reproduced in Figure 5. As you have seen, the default plot generated, also without a geom argument, is the scatter plot, which is the default bivariate plot type. In this plot we may have an idea of the tendency the data have, for instance we see that the teeth length increase by increasing the amount of vitamin C intake. On the other side, we know that there are two different subgroups in our data, since the vitamin C was provided in two different ways, as orange juice or as ascorbic acid, so it could be interesting to check if these two groups behave differently. Figure 5: Scatterplot of length vs. dose of ToothGrowth data The first approach could be to have the data in two different colours. To do that we simply need to assign the colour attribute to the column sup in the data, which defines the way of vitamin intake. The resulting plot is in Figure 6. qplot(dose, len,data=ToothGrowth, geom="point", col=supp) We now can distinguish from which intake route come each data in the plot and it looks like the data from orange juice shown are a little higher compared to ascorbic acid, but to differentiate between them it is not really easy. We could then try with the facets, so that the data will be completely separated in two different sub-plots. So let´s see what happens. Figure 6: Scatterplot of length vs. dose of ToothGrowth with data in different colours depending on vitamin intake. qplot(dose, len,data=ToothGrowth, geom="point", facets=.~supp) In this new plot, showed in Figure 7, we definitely have a better picture of the data, since we can see how the tooth growth differs for the different intakes. As you have seen, in this simple example, you will find that the best visualization may be different depending on the data you have. In some cases grouping a variable with colours or dividing the data with faceting may give you a different idea about the data and their tendency. For instance in our case with the plot in Figure 7 we can see that the way how the tooth growth increase with dose seems to be different for the different intake routes. Figure 7: Scatterplot of length vs. dose of ToothGrowth with faceting One approach to see the general tendency of the data could be to include a smooth line to the graph. In this case in fact we can see that the growth in the case of the orange juice does not looks really linear, so a smooth line could be a nice way to catch this. In order to do that we simply add a smooth curve to the vector of geometry components in the qplot() function. qplot(dose, len,data=ToothGrowth, geom=c("point","smooth"), facets=.~supp) As you can see from the plot obtained (Figure 8) we now see not only clearly the different data thanks to the faceting, but we can also see the tendency of the data with respect to the dose administered. As you have seen, requiring for the smooth line in ggplot2 will also include a confidence interval in the plot. If you would like to not to have the confidence interval you may simply add the argument se=FALSE. Figure 8: Scatterplot of length vs. dose of ToothGrowth with faceting and smooth line Summary In this short article we have seen some basic concept of ggplot2, ranging from the basic principles in comparison with the other R packages for graphics, up to some basic plots as for instance histograms, density plots or scatterplots. In this case we have limited our example to the use of qplot(), which enable us to obtain plots with some easy commands, but on the other side, in order to have a full control of plot appearance as well as data representation the function ggplot() will provide you with much more advanced functionalities. You can find a more detailed description of these functions as well as of the different features of ggplot2 together illustrated in various examples in the book ggplot2 Essentials. Resources for Article: Further resources on this subject: Data Analysis Using R [article] Data visualization [article] Using R for Statistics, Research, and Graphics [article]

0
0
10526

article-image-querying-and-filtering-data

Packt

25 Jun 2015

28 min read

Querying and Filtering Data

Packt

25 Jun 2015

28 min read

0
0
10430

article-image-moving-further-numpy-modules

Packt

23 Jun 2015

23 min read

Moving Further with NumPy Modules

Packt

23 Jun 2015

23 min read

0
0
4499

Packt

22 Jun 2015

25 min read

The pandas Data Structures

Packt

22 Jun 2015

25 min read

0
0
4873

article-image-documents-and-collections-data-modeling-mongodb

Packt

22 Jun 2015

12 min read

Documents and Collections in Data Modeling with MongoDB

Packt

22 Jun 2015

12 min read

In this article by Wilson da Rocha França, author of the book, MongoDB Data Modeling, we will cover documents and collections used in data modeling with MongoDB. (For more resources related to this topic, see here.) Data modeling is a very important process during the conception of an application since this step will help you to define the necessary requirements for the database's construction. This definition is precisely the result of the data understanding acquired during the data modeling process. As previously described, this process, regardless of the chosen data model, is commonly divided into two phases: one that is very close to the user's view and the other that is a translation of this view to a conceptual schema. In the scenario of relational database modeling, the main challenge is to build a robust database from these two phases, with the aim of guaranteeing updates to it with any impact during the application's lifecycle. A big advantage of NoSQL compared to relational databases is that NoSQL databases are more flexible at this point, due to the possibility of a schema-less model that, in theory, can cause less impact on the user's view if a modification in the data model is needed. Despite the flexibility NoSQL offers, it is important to previously know how we will use the data in order to model a NoSQL database. It is a good idea not to plan the data format to be persisted, even in a NoSQL database. Moreover, at first sight, this is the point where database administrators, quite used to the relational world, become more uncomfortable. Relational database standards, such as SQL, brought us a sense of security and stability by setting up rules, norms, and criteria. On the other hand, we will dare to state that this security turned database designers distant of the domain from which the data to be stored is drawn. The same thing happened with application developers. There is a notable divergence of interests among them and database administrators, especially regarding data models. The NoSQL databases practically bring the need for an approximation between database professionals and the applications, and also the need for an approximation between developers and databases. For that reason, even though you may be a data modeler/designer or a database administrator, don't be scared if from now on we address subjects that are out of your comfort zone. Be prepared to start using words common from the application developer's point of view, and add them to your vocabulary. This article will cover the following: Introducing your documents and collections The document's characteristics and structure Introducing documents and collections MongoDB has the document as a basic unity of data. The documents in MongoDB are represented in JavaScript Object Notation (JSON). Collections are groups of documents. Making an analogy, a collection is similar to a table in a relational model and a document is a record in this table. And finally, collections belong to a database in MongoDB. The documents are serialized on disk in a format known as Binary JSON (BSON), a binary representation of a JSON document. An example of a document is: { "_id": 123456, "firstName": "John", "lastName": "Clay", "age": 25, "address": { "streetAddress": "131 GEN. Almério de Moura Street", "city": "Rio de Janeiro", "state": "RJ", "postalCode": "20921060" }, "phoneNumber":[ { "type": "home", "number": "+5521 2222-3333" }, { "type": "mobile", "number": "+5521 9888-7777" } ] } Unlike the relational model, where you must declare a table structure, a collection doesn't enforce a certain structure for a document. It is possible that a collection contains documents with completely different structures. We can have, for instance, on the same users collection: { "_id": "123456", "username": "johnclay", "age": 25, "friends":[ {"username": "joelsant"}, {"username": "adilsonbat"} ], "active": true, "gender": "male" } We can also have: { "_id": "654321", "username": "santymonty", "age": 25, "active": true, "gender": "male", "eyeColor": "brown" } In addition to this, another interesting feature of MongoDB is that not just data is represented by documents. Basically, all user interactions with MongoDB are made through documents. Besides data recording, documents are a means to: Define what data can be read, written, and/or updated in queries Define which fields will be updated Create indexes Configure replication Query the information from the database Before we go deep into the technical details of documents, let's explore their structure. JSON JSON is a text format for the open-standard representation of data and that is ideal for data traffic. To explore the JSON format deeper, you can check ECMA-404 The JSON Data Interchange Standard where the JSON format is fully described. JSON is described by two standards: ECMA-404 and RFC 7159. The first one puts more focus on the JSON grammar and syntax, while the second provides semantic and security considerations. As the name suggests, JSON arises from the JavaScript language. It came about as a solution for object state transfers between the web server and the browser. Despite being part of JavaScript, it is possible to find generators and readers for JSON in almost all the most popular programming languages such as C, Java, and Python. The JSON format is also considered highly friendly and human-readable. JSON does not depend on the platform chosen, and its specification are based on two data structures: A set or group of key/value pairs A value ordered list So, in order to clarify any doubts, let's talk about objects. Objects are a non-ordered collection of key/value pairs that are represented by the following pattern: { "key" : "value" } In relation to the value ordered list, a collection is represented as follows: ["value1", "value2", "value3"] In the JSON specification, a value can be: A string delimited with " " A number, with or without a sign, on a decimal base (base 10). This number can have a fractional part, delimited by a period (.), or an exponential part followed by e or E Boolean values (true or false) A null value Another object Another value ordered array The following diagram shows us the JSON value structure: Here is an example of JSON code that describes a person: { "name" : "Han", "lastname" : "Solo", "position" : "Captain of the Millenium Falcon", "species" : "human", "gender":"male", "height" : 1.8 } BSON BSON means Binary JSON, which, in other words, means binary-encoded serialization for JSON documents. If you are seeking more knowledge on BSON, I suggest you take a look at the BSON specification on http://bsonspec.org/. If we compare BSON to the other binary formats, BSON has the advantage of being a model that allows you more flexibility. Also, one of its characteristics is that it's lightweight—a feature that is very important for data transport on the Web. The BSON format was designed to be easily navigable and both encoded and decoded in a very efficient way for most of the programming languages that are based on C. This is the reason why BSON was chosen as the data format for MongoDB disk persistence. The types of data representation in BSON are: String UTF-8 (string) Integer 32-bit (int32) Integer 64-bit (int64) Floating point (double) Document (document) Array (document) Binary data (binary) Boolean false (x00 or byte 0000 0000) Boolean true (x01 or byte 0000 0001) UTC datetime (int64)—the int64 is UTC milliseconds since the Unix epoch Timestamp (int64)—this is the special internal type used by MongoDB replication and sharding; the first 4 bytes are an increment, and the last 4 are a timestamp Null value () Regular expression (cstring) JavaScript code (string) JavaScript code w/scope (code_w_s) Min key()—the special type that compares a lower value than all other possible BSON element values Max key()—the special type that compares a higher value than all other possible BSON element values ObjectId (byte*12) Characteristics of documents Before we go into detail about how we must model documents, we need a better understanding of some of its characteristics. These characteristics can determine your decision about how the document must be modeled. The document size We must keep in mind that the maximum length for a BSON document is 16 MB. According to BSON specifications, this length is ideal for data transfers through the Web and to avoid the excessive use of RAM. But this is only a recommendation. Nowadays, a document can exceed the 16 MB length by using GridFS. GridFS allows us to store documents in MongoDB that are larger than the BSON maximum size, by dividing it into parts, or chunks. Each chunk is a new document with 255 K of size. Names and values for a field in a document There are a few things that you must know about names and values for fields in a document. First of all, any field's name in a document is a string. As usual, we have some restrictions on field names. They are: The _id field is reserved for a primary key You cannot start the name using the character $ The name cannot have a null character, or (.) Additionally, documents that have indexed fields must respect the size limit for an indexed field. The values cannot exceed the maximum size of 1,024 bytes. The document primary key As seen in the preceding section, the _id field is reserved for the primary key. By default, this field must be the first one in the document, even when, during an insertion, it is not the first field to be inserted. In these cases, MongoDB moves it to the first position. Also, by definition, it is in this field that a unique index will be created. The _id field can have any value that is a BSON type, except the array. Moreover, if a document is created without an indication of the _id field, MongoDB will automatically create an _id field of the ObjectId type. However, this is not the only option. You can use any value you want to identify your document as long as it is unique. There is another option, that is, generating an auto-incremental value based on a support collection or on an optimistic loop. Support collections In this method, we use a separate collection that will keep the last used value in the sequence. To increment the sequence, first we should query the last used value. After this, we can use the operator $inc to increment the value. There is a collection called system.js that can keep the JavaScript code in order to reuse it. Be careful not to include application logic in this collection. Let's see an example for this method: db.counters.insert( { _id: "userid", seq: 0 } ) function getNextSequence(name) { var ret = db.counters.findAndModify( { query: { _id: name }, update: { $inc: { seq: 1 } }, new: true } ); return ret.seq; } db.users.insert( { _id: getNextSequence("userid"), name: "Sarah C." } ) The optimistic loop The generation of the _id field by an optimistic loop is done by incrementing each iteration and, after that, attempting to insert it in a new document: function insertDocument(doc, targetCollection) { while (1) { var cursor = targetCollection.find( {}, { _id: 1 } ).sort( { _id: -1 } ).limit(1); var seq = cursor.hasNext() ? cursor.next()._id + 1 : 1; doc._id = seq; var results = targetCollection.insert(doc); if( results.hasWriteError() ) { if( results.writeError.code == 11000 /* dup key */ ) continue; else print( "unexpected error inserting data: " + tojson( results ) ); } break; } } In this function, the iteration does the following: Searches in targetCollection for the maximum value for _id. Settles the next value for _id. Sets the value on the document to be inserted. Inserts the document. In the case of errors due to duplicated _id fields, the loop repeats itself, or else the iteration ends. The points demonstrated here are the basics to understanding all the possibilities and approaches that this tool can offer. But, although we can use auto-incrementing fields for MongoDB, we must avoid using them because this tool does not scale for a huge data mass. Summary In this article, you saw how to build documents in MongoDB, examined their characteristics, and saw how they are organized into collections. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [article] About MongoDB [article] Creating a RESTful API [article]

0
0
2397

Packt

16 Jun 2015

8 min read

Clustering

Packt

16 Jun 2015

8 min read

In this article by Jayani Withanawasam, author of the book Apache Mahout Essentials, we will see the clustering technique in machine learning and its implementation using Apache Mahout. The K-Means clustering algorithm is explained in detail with both Java and command-line examples (sequential and parallel executions), and other important clustering algorithms, such as Fuzzy K-Means, canopy clustering, and spectral K-Means are also explored. In this article, we will cover the following topics: Unsupervised learning and clustering Applications of clustering Types of clustering K-Means clustering K-Means clustering with MapReduce (For more resources related to this topic, see here.) Unsupervised learning and clustering Information is a key driver for any type of organization. However, with the rapid growth in the volume of data, valuable information may be hidden and go unnoticed due to the lack of effective data processing and analyzing mechanisms. Clustering is an unsupervised learning mechanism that can find the hidden patterns and structures in data by finding data points that are similar to each other. No prelabeling is required. So, you can organize data using clustering with little or no human intervention. For example, let's say you are given a collection of balls of different sizes without any category labels, such as big and small, attached to them; you should be able to categorize them using clustering by considering their attributes, such as radius and weight, for similarity. We will learn how to use Apache Mahout to perform clustering using different algorithms. Applications of clustering Clustering has many applications in different domains, such as biology, business, and information retrieval. Computer vision and image processing Clustering techniques are widely used in the computer vision and image processing domain. Clustering is used for image segmentation in medical image processing for computer aided disease (CAD) diagnosis. One specific area is breast cancer detection. In breast cancer detection, a mammogram is clustered into several parts for further analysis, as shown in the following image. The regions of interest for signs of breast cancer in the mammogram can be identified using the K-Means algorithm. Image features such as pixels, colors, intensity, and texture are used during clustering: Types of clustering Clustering can be divided into different categories based on different criteria. Hard clustering versus soft clustering Clustering techniques can be divided into hard clustering and soft clustering based on the cluster's membership. In hard clustering, a given data point in n-dimensional space only belongs to one cluster. This is also known as exclusive clustering. The K-Means clustering mechanism is an example of hard clustering. A given data point can belong to more than one cluster in soft clustering. This is also known as overlapping clustering. The Fuzzy K-Means algorithm is a good example of soft clustering. A visual representation of the difference between hard clustering and soft clustering is given in the following figure: Flat clustering versus hierarchical clustering In hierarchical clustering, a hierarchy of clusters is built using the top-down (divisive) or bottom-up (agglomerative) approach. This is more informative and accurate than flat clustering, which is a simple technique where no hierarchy is present. However, this comes at the cost of performance, as flat clustering is faster and more efficient than hierarchical clustering. For example, let's assume that you need to figure out T-shirt sizes for people of different sizes. Using hierarchal clustering, you can come up with sizes for small (s), medium (m), and large (l) first by analyzing a sample of the people in the population. Then, we can further categorize this as extra small (xs), small (s), medium, large (l), and extra large (xl) sizes. Model-based clustering In model-based clustering, data is modeled using a standard statistical model to work with different distributions. The idea is to find a model that best fits the data. The best-fit model is achieved by tuning up parameters to minimize loss on errors. Once the parameter values are set, probability membership can be calculated for new data points using the model. Model-based clustering gives a probability distribution over clusters. K-Means clustering K-Means clustering is a simple and fast clustering algorithm that has been widely adopted in many problem domains. We will give a detailed explanation of the K-Means algorithm, as it will provide the base for other algorithms. K-Means clustering assigns data points to k number of clusters (cluster centroids) by minimizing the distance from the data points to the cluster centroids. Let's consider a simple scenario where we need to cluster people based on their size (height and weight are the selected attributes) and different colors (clusters): We can plot this problem in two-dimensional space, as shown in the following figure and solve it using the K-Means algorithm: Getting your hands dirty! Let's move on to a real implementation of the K-Means algorithm using Apache Mahout. The following are the different ways in which you can run algorithms in Apache Mahout: Sequential MapReduce You can execute the algorithms using a command line (by calling the correct bin/mahout subcommand) or using Java programming (calling the correct driver's run method). Running K-Means using Java programming This example continues with the people-clustering scenario mentioned earlier. The size (weight and height) distribution for this example has been plotted in two-dimensional space, as shown in the following image: Data preparation First, we need to represent the problem domain as numerical vectors. The following table shows the size distribution of people mentioned in the previous scenario: Weight (kg) Height (cm) 22 80 25 75 28 85 55 150 50 145 53 153 Save the following content in a file named KmeansTest.data: 22 80 25 75 28 85 55 150 50 145 53 153 Understanding important parameters Let's take a look at the significance of some important parameters: org.apache.hadoop.fs.Path: This denotes the path to a file or directory in the filesystem. org.apache.hadoop.conf.Configuration: This provides access to Hadoop-related configuration parameters. org.apache.mahout.common.distance.DistanceMeasure: This determines the distance between two points. K: This denotes the number of clusters. convergenceDelta: This is a double value that is used to determine whether the algorithm has converged. maxIterations: This denotes the maximum number of iterations to run. runClustering: If this is true, the clustering step is to be executed after the clusters have been determined. runSequential: If this is true, the K-Means sequential implementation is to be used in order to process the input data. The following code snippet shows the source code: private static final String DIRECTORY_CONTAINING_CONVERTED_INPUT ="Kmeansdata";public static void main(String[] args) throws Exception {// Path to output folderPath output = new Path("Kmeansoutput");// Hadoop configuration detailsConfiguration conf = new Configuration();HadoopUtil.delete(conf, output);run(conf, new Path("KmeansTest"), output, newEuclideanDistanceMeasure(), 2, 0.5, 10);}public static void run(Configuration conf, Path input, Pathoutput, DistanceMeasure measure, int k,double convergenceDelta, int maxIterations) throws Exception {// Input should be given as sequence file formatPath directoryContainingConvertedInput = new Path(output,DIRECTORY_CONTAINING_CONVERTED_INPUT);InputDriver.runJob(input, directoryContainingConvertedInput,"org.apache.mahout.math.RandomAccessSparseVector");// Get initial clusters randomlyPath clusters = new Path(output, "random-seeds");clusters = RandomSeedGenerator.buildRandom(conf,directoryContainingConvertedInput, clusters, k, measure);// Run K-Means with a given KKMeansDriver.run(conf, directoryContainingConvertedInput,clusters, output, convergenceDelta,maxIterations, true, 0.0, false);// run ClusterDumper to display resultPath outGlob = new Path(output, "clusters-*-final");Path clusteredPoints = new Path(output,"clusteredPoints");ClusterDumper clusterDumper = new ClusterDumper(outGlob,clusteredPoints);clusterDumper.printClusters(null);} Use the following code example in order to get a better (readable) outcome to analyze the data points and the centroids they are assigned to: Reader reader = new SequenceFile.Reader(fs,new Path(output,Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000"), conf);IntWritable key = new IntWritable();WeightedPropertyVectorWritable value = newWeightedPropertyVectorWritable();while (reader.next(key, value)) {System.out.println("key: " + key.toString()+ " value: "+value.toString());}reader.close(); After you run the algorithm, you will see the clustering output generated for each iteration and the final result in the filesystem (in the output directory you have specified; in this case, Kmeansoutput). Summary Clustering is an unsupervised learning mechanism that requires minimal human effort. Clustering has many applications in different areas, such as medical image processing, market segmentation, and information retrieval. Clustering mechanisms can be divided into different types, such as hard, soft, flat, hierarchical, and model-based clustering based on different criteria. Apache Mahout implements different clustering algorithms, which can be accessed sequentially or in parallel (using MapReduce). The K-Means algorithm is a simple and fast algorithm that is widely applied. However, there are situations that the K-Means algorithm will not be able to cater to. For such scenarios, Apache Mahout has implemented other algorithms, such as canopy, Fuzzy K-Means, streaming, and spectral clustering. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [Article] Introduction to Apache ZooKeeper [Article] Creating an Apache JMeter™ test workbench [Article]

0
0
2887

How-To Tutorials - Data

Understanding Hadoop Backup and Recovery Needs

Bayesian Network Fundamentals

NLTK for hackers

Elasticsearch – Spicing Up a Search Using Geo

Getting Started with Apache Spark

Clustering and Other Unsupervised Learning Methods

Working with large data sources

Transactions in Redis

CoreOS – Overview and Installation

Introduction to ggplot2 and the plotting environments in R

Trending Topics

Querying and Filtering Data

Moving Further with NumPy Modules

The pandas Data Structures

Documents and Collections in Data Modeling with MongoDB

Clustering

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access