Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-how-to-maintain-apache-mesos
Vijin Boricha
13 Feb 2018
6 min read
Save for later

How to maintain Apache Mesos

Vijin Boricha
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by David Blomquist and Tomasz Janiszewski, titled Apache Mesos Cookbook. Throughout the course of the book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In this article, we will learn about configuring logging options, setting up monitoring ecosystem, and upgrading your Mesos cluster. Logging and debugging Here we will configure logging options that will allow us to debug the state of Mesos. Getting ready We will assume Mesos is available on localhost port 5050. The steps provided here will work for either master or agents. How to do it... When Mesos is installed from pre-built packages, the logs are by default stored in /var/log/mesos/. When installing from a source build, storing logs is disabled by default. To change the log store location, we need to edit /etc/default/mesos and set the LOGS variable to the desired destination. For some reason, mesos-init-wrapper does not transfer the contents of /etc/mesos/log_dir to the --log_dir flag. That's why we need to set the log's destination in the environment variable. Remember that only Mesos logs will be stored there. Logs from third-party applications (for example, ZooKeeper) will still be sent to STDERR. Changing the default logging level can be done in one of two ways: by specifying the -- logging_level flag or by sending a request and changing the logging level at runtime for a specific period of time. For example, to change the logging level to INFO, just put it in the following code: /etc/mesos/logging_level echo INFO > /etc/mesos/logging_level The possible levels are INFO, WARNING, and ERROR. For example, to change the logging level to the most verbose for 15 minutes for debug purposes, we need to send the following request to the logging/toggle endpoint: curl -v -X POST localhost:5050/logging/toggle?level=3&duration=15mins How it works... Mesos uses the Google-glog library for debugging, but third-party dependencies such as ZooKeeper have their own logging solution. All configuration options are backed by glog and apply only to Mesos core code. Monitoring Now, we will set up monitoring for Mesos. Getting ready We must have a running monitoring ecosystem. Metrics storage could be a simple time- series database such as graphite, influxdb, or prometheus. In the following example, we are using graphite and our metrics are published with http://diamond.readthedocs.io/en/latest/. How to do it... Monitoring is enabled by default. Mesos does not provide any way to automatically push metrics to the registry. However, it exposes them as a JSON that can be periodically pulled and saved into the metrics registry:  Install Diamond using following command: pip install diamond  If additional packages are required to install them, run: sudo apt-get install python-pip python-dev build-essential. pip (Pip Installs Packages) is a Python package manager used to install software written in Python. Configure the metrics handler and interval. Open /etc/diamond/diamond.conf and ensure that there is a section for graphite configuration: [handler_graphite] class = handlers.GraphiteHandler host = <graphite.host> port = <graphite.port> Remember to replace graphite.host and graphite.port with real graphite details. Enable the default Mesos Collector. Create configuration files diamond-setup  - C MesosCollector. Check whether the configuration has proper values and edit them if needed. The configuration can be found in /etc/diamond/collectors/MesosCollector.conf. On master, this file should look like this: enabled = True host = localhost port = 5050 While on agent, the port could be different (5051), as follows: enabled = True host = localhost port = 5051 How it works... Mesos exposes metrics via the HTTP API. Diamond is a small process that periodically pulls metrics, parses them, and sends them to the metrics registry, in this case, graphite. The default implementation of Mesos Collector does not store all the available metrics so it's recommended to write a custom handler that will collect all the interesting information. See also... Metrics could be read from the following endpoints: http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/ http://mesos.apache.org/documentation/latest/endpoints/slave/monitor/statistics/  http://mesos.apache.org/documentation/latest/endpoints/slave/state/ Upgrading Mesos In this recipe, you will learn how to upgrade your Mesos cluster. How to do it... Mesos release cadence is at least one release per quarter. Minor releases are backward compatible, although there could be some small incompatibilities or the dropping of deprecated methods. The recommended method of upgrading is to apply all intermediate versions. For example, to upgrade from 0.27.2 to 1.0.0, we should apply 0.28.0, 0.28.1, 0.28.2, and finally 1.0.0. If the agent's configuration changes, clearing the metadata directory is required. You can do this with the following code: rm -rv {MESOS_DIR}/metadata Here, {MESOS_DIR} should be replaced with the configured Mesos directory. Rolling upgrades is the preferred method of upgrading clusters, starting with masters and then agents. To minimize the impact on running tasks, if an agent's configuration changes and it becomes inaccessible, then it should be switched to maintenance mode. How it works... Configuration changes may require clearing the metadata because the changes may not be backward compatible. For example, when an agent runs with different isolators, it shouldn't attach to the already running processes without this isolator. The Mesos architecture will guarantee that the executors that were not attached to the Mesos agent will commit suicide after a configurable amount of time (--executor_registration_timeout). Maintenance mode allows you to declare the time window during which the agent will be inaccessible. When this occurs, Mesos will send a reverse offer to all the frameworks to drain that particular agent. The frameworks are responsible for shutting down its task and spawning it on another agent. The Maintenance mode is applied, even if the framework does not implement the HTTP API or is explicitly declined. Using maintenance mode can prevent restarting tasks multiple times. Consider the following example with five agents and one task, X. We schedule the rolling upgrade of all the agents. Task X is deployed on agent 1. When it goes down, it's moved to 2, then to 3, and so on. This approach is extremely inefficient because the task is restarted five times, but it only needs to be restarted twice. Maintenance mode enables the framework to optimally schedule the task to run on agent 5 when 1 goes down, and then return to 1 when 5 goes down: Worst case scenario of rolling upgrade without maintenance mode legend optimal solution of rolling upgrade with maintenance mode. We have learnt about running and maintaining Mesos. To know more about managing containers and understanding the scheduler API you may check out this book, Apache Mesos Cookbook.
Read more
  • 0
  • 0
  • 3686

article-image-hypothesis-testing-with-r
Richa Tripathi
13 Feb 2018
8 min read
Save for later

Hypothesis testing with R

Richa Tripathi
13 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Learning Quantitative Finance with R written by Dr. Param Jeet and Prashant Vats. This book will help you understand the basics of R and how they can be applied in various Quantitative Finance scenarios.[/box] Hypothesis testing is used to reject or retain a hypothesis based upon the measurement of an observed sample. So in today’s tutorial we will discuss how to implement the various scenarios of hypothesis testing in R. Lower tail test of population mean with known variance The null hypothesis is given by where is the hypothesized lower bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is greater than $10. The average of 30 days' daily return sample is $9.9. Assume the population standard deviation is 0.011. Can we reject the null hypothesis at .05 significance level? Now let us calculate the test statistics z which can be computed by the following code in R: > xbar= 9.9 > mu0 = 10 > sig = 1.1 > n = 30 > z = (xbar-mu0)/(sig/sqrt(n)) > z Here: xbar: Sample mean mu: Hypothesized value sig: Standard deviation of population n: Sample size z: Test statistics This gives the value of z the test statistics: [1] -0.4979296 Now let us find out the critical value at 0.05 significance level. It can be computed by the following code: > alpha = .05 > z.alpha = qnorm(1-alpha) > -z.alpha This gives the following output: [1] -1.644854 Since the value of the test statistics is greater than the critical value, we fail to reject the null hypothesis claim that the return is greater than $10. In place of using the critical value test, we can use the pnorm function to compute the lower tail of Pvalue test statistics. This can be computed by the following code: > pnorm(z) This gives the following output: [1] 0.3092668 Since the Pvalue is greater than 0.05, we fail to reject the null hypothesis. Upper tail test of population mean with known variance The null hypothesis is given by  where  is the hypothesized upper bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is at most $5. The average of 30 days' daily return sample is $5.1. Assume the population standard deviation is 0.25. Can we reject the null hypothesis at .05 significance level? Now let us calculate the test statistics z, which can be computed by the following code in R: > xbar= 5.1 > mu0 = 5 > sig = .25 > n = 30 > z = (xbar-mu0)/(sig/sqrt(n)) > z Here: xbar: Sample mean mu0: Hypothesized value sig: Standard deviation of population n: Sample size z: Test statistics It gives 2.19089 as the value of test statistics. Now let us calculate the critical value at .05 significance level, which is given by the following code: > alpha = .05 > z.alpha = qnorm(1-alpha) > z.alpha This gives 1.644854, which is less than the value computed for the test statistics. Hence we reject the null hypothesis claim. Also, the Pvalue of the test statistics is given as follows: >pnorm(z, lower.tail=FALSE) This gives 0.01422987, which is less than 0.05 and hence we reject the null hypothesis. Two-tailed test of population mean with known variance The null hypothesis is given by  where  is the hypothesized value of the population mean. Let us assume a scenario where the mean of daily returns of a stock last year is $2. The average of 30 days' daily return sample is $1.5 this year. Assume the population standard deviation is .2. Can we reject the null hypothesis that there is not much significant difference in returns this year from last year at .05 significance level? Now let us calculate the test statistics z, which can be computed by the following code in R: > xbar= 1.5 > mu0 = 2 > sig = .1 > n = 30 > z = (xbar-mu0)/(sig/sqrt(n)) > z This gives the value of test statistics as -27.38613. Now let us try to find the critical value for comparing the test statistics at .05 significance level. This is given by the following code: >alpha = .05 >z.half.alpha = qnorm(1-alpha/2) >c(-z.half.alpha, z.half.alpha) This gives the value -1.959964, 1.959964. Since the value of test statistics is not between the range (-1.959964, 1.959964), we reject the claim of the null hypothesis that there is not much significant difference in returns this year from last year at .05 significance level. The two-tailed Pvalue statistics is given as follows: >2*pnorm(z) This gives a value less than .05 so we reject the null hypothesis. In all the preceding scenarios, the variance is known for population and we use the normal distribution for hypothesis testing. However, in the next scenarios, we will not be given the variance of the population so we will be using t distribution for testing the hypothesis. Lower tail test of population mean with unknown variance The null hypothesis is given by  where  is the hypothesized lower bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is greater than $1. The average of 30 days' daily return sample is $.9. Assume the population standard deviation is 0.01. Can we reject the null hypothesis at .05 significance level? In this scenario, we can compute the test statistics by executing the following code: > xbar= .9 > mu0 = 1 > sig = .1 > n = 30 > t = (xbar-mu0)/(sig/sqrt(n)) > t Here: xbar: Sample mean mu0: Hypothesized value sig: Standard deviation of sample n: Sample size t: Test statistics This gives the value of the test statistics as -5.477226. Now let us compute the critical value at .05 significance level. This is given by the following code: > alpha = .05 > t.alpha = qt(1-alpha, df=n-1) > -t.alpha We get the value as -1.699127. Since the value of the test statistics is less than the critical value, we reject the null hypothesis claim. Now instead of the value of the test statistics, we can use the Pvalue associated with the test statistics, which is given as follows: >pt(t, df=n-1) This results in a value less than .05 so we can reject the null hypothesis claim. Upper tail test of population mean with unknown variance The null hypothesis is given by where  is the hypothesized upper bound of the population mean. Let us assume a scenario where an investor assumes that the mean of daily returns of a stock since inception is at most $3. The average of 30 days' daily return sample is $3.1. Assume the population standard deviation is .2. Can we reject the null hypothesis at .05 significance level? Now let us calculate the test statistics t which can be computed by the following code in R: > xbar= 3.1 > mu0 = 3 > sig = .2 > n = 30 > t = (xbar-mu0)/(sig/sqrt(n)) > t Here: xbar: Sample mean mu0: Hypothesized value sig: Standard deviation of sample n: Sample size t: Test statistics This gives the value 2.738613 of the test statistics. Now let us find the critical value associated with the .05 significance level for the test statistics. It is given by the following code: > alpha = .05 > t.alpha = qt(1-alpha, df=n-1) > t.alpha Since the critical value 1.699127 is less than the value of the test statistics, we reject the null hypothesis claim. Also, the value associated with the test statistics is given as follows: >pt(t, df=n-1, lower.tail=FALSE) This is less than .05. Hence the null hypothesis claim gets rejected. Two tailed test of population mean with unknown variance The null hypothesis is given by , where  is the hypothesized value of the population mean. Let us assume a scenario where the mean of daily returns of a stock last year is $2. The average of 30 days' daily return sample is $1.9 this year. Assume the population standard deviation is .1. Can we reject the null hypothesis that there is not much significant difference in returns this year from last year at .05 significance level? Now let us calculate the test statistics t, which can be computed by the following code in R: > xbar= 1.9 > mu0 = 2 > sig = .1 > n = 30 > t = (xbar-mu0)/(sig/sqrt(n)) > t This gives -5.477226 as the value of the test statistics. Now let us try to find the critical value range for comparing, which is given by the following code: > alpha = .05 > t.half.alpha = qt(1-alpha/2, df=n-1) > c(-t.half.alpha, t.half.alpha) This gives the range value (-2.04523, 2.04523). Since this is the value of the test statistics, we reject the claim of the null hypothesis We learned how to practically perform one-tailed/ two-tailed hypothesis testing with known as well as unknown variance using R. If you enjoyed this excerpt, check out the book  Learning Quantitative Finance with R to explore different methods to manage risks and trading using Machine Learning with R.
Read more
  • 0
  • 0
  • 15203

article-image-how-to-build-a-music-recommendation-system-with-pagerank-algorithm
Vijin Boricha
13 Feb 2018
6 min read
Save for later

How to Build a music recommendation system with PageRank Algorithm

Vijin Boricha
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering Spark for Data Science written by Andrew Morgan and Antoine Amend. In this book, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms to scale them linearly.[/box] In today’s tutorial, we will learn to build a recommender with PageRank algorithm. The PageRank algorithm Instead of recommending a specific song, we will recommend playlists. A playlist would consist of a list of all our songs ranked by relevance, most to least relevant. Let's begin with the assumption that people listen to music in a similar way to how they browse articles on the web, that is, following a logical path from link to link, but occasionally switching direction, or teleporting, and browsing to a totally different website. Continuing with the analogy, while listening to music one can either carry on listening to music of a similar style (and hence follow their most expected journey), or skip to a random song in a totally different genre. It turns out that this is exactly how Google ranks websites by popularity using a PageRank algorithm. For more details on the PageRank algorithm visit: h t t p ://i l p u b s . s t a n f o r d . e d u :8090/422/1/1999- 66. p d f . The popularity of a website is measured by the number of links it points to (and is referred from). In our music use case, the popularity is built as the number hashes a given song shares with all its neighbors. Instead of popularity, we introduce the concept of song commonality. Building a Graph of Frequency Co-occurrence We start by reading our hash values back from Cassandra and re-establishing the list of song IDs for each distinct hash. Once we have this, we can count the number of hashes for each song using a simple reduceByKey function, and because the audio library is relatively small, we collect and broadcast it to our Spark executors: val hashSongsRDD = sc.cassandraTable[HashSongsPair]("gzet", "hashes") val songHashRDD = hashSongsRDD flatMap { hash => hash.songs map { song => ((hash, song), 1) } } val songTfRDD = songHashRDD map { case ((hash,songId),count) => (songId, count) } reduceByKey(_+_) val songTf = sc.broadcast(songTfRDD.collectAsMap()) Next, we build a co-occurrence matrix by getting the cross product of every song sharing a same hash value, and count how many times the same tuple is observed. Finally, we wrap the song IDs and the normalized (using the term frequency we just broadcast) frequency count inside of an Edge class from GraphX: implicit class Crossable[X](xs: Traversable[X]) { def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y) val crossSongRDD = songHashRDD.keys .groupByKey() .values .flatMap { songIds => songIds cross songIds filter { case (from, to) => from != to }.map(_ -> 1) }.reduceByKey(_+_) .map { case ((from, to), count) => val weight = count.toDouble / songTfB.value.getOrElse(from, 1) Edge(from, to, weight) }.filter { edge => edge.attr > minSimilarityB.value } val graph = Graph.fromEdges(crossSongRDD, 0L) We are only keeping edges with a weight (meaning a hash co-occurrence) greater than a predefined threshold in order to build our hash frequency graph. Running PageRank Contrary to what one would normally expect when running a PageRank, our graph is undirected. It turns out that for our recommender, the lack of direction does not matter, since we are simply trying to find similarities between Led Zeppelin and Spirit. A possible way of introducing direction could be to look at the song publishing date. In order to find musical influences, we could certainly introduce a chronology from the oldest to newest songs giving directionality to our edges. In the following pageRank, we define a probability of 15% to skip, or teleport as it is known, to any random song, but this can be obviously tuned for different needs: val prGraph = graph.pageRank(0.001, 0.15) Finally, we extract the page ranked vertices and save them as a playlist in Cassandra via an RDD of the Song case class: case class Song(id: Long, name: String, commonality: Double) val vertices = prGraph .vertices .mapPartitions { vertices => val songIds = songIdsB .value .vertices .map { case (songId, pr) => val songName = songIds.get(vId).get Song(songId, songName, pr) } } vertices.saveAsCassandraTable("gzet", "playlist") The reader may be pondering the exact purpose of PageRank here, and how it could be used as a recommender? In fact, our use of PageRank means that the highest ranking songs would be the ones that share many frequencies with other songs. This could be due to a common arrangement, key theme, or melody; or maybe because a particular artist was a major influence on a musical trend. However, these songs should be, at least in theory, more popular (by virtue of the fact they occur more often), meaning that they are more likely to have mass appeal. On the other end of the spectrum, low ranking songs are ones where we did not find any similarity with anything we know. Either these songs are so avant-garde that no one has explored these musical ideas before, or alternatively are so bad that no one ever wanted to copy them! Maybe they were even composed by that up-and-coming artist you were listening to in your rebellious teenage years. Either way, the chance of a random user liking these songs is treated as negligible. Surprisingly, whether it is a pure coincidence or whether this assumption really makes sense, the lowest ranked song from this particular audio library is Daft Punk's–Motherboard it is a title that is quite original (a brilliant one though) and a definite unique sound. To summarize, we have learnt how to build a complete recommendation system for a song playlist. You can check out the book Mastering Spark for Data Science to deep dive into Spark and deliver other production grade data science solutions. Read our post on how deep learning is revolutionizing the music industry. And here is how you can analyze big data using the pagerank algorithm.  
Read more
  • 0
  • 0
  • 47410

article-image-tableau-data-handling-engine-offer
Amarabha Banerjee
13 Feb 2018
6 min read
Save for later

What Tableau Data Handling Engine has to offer

Amarabha Banerjee
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is taken from the book Mastering Tableau, written by David Baldwin. This book will equip you with all the information needed to create effective dashboards and data visualization solutions using Tableau.[/box] In today’s tutorial, we shall explore the Tableau data handling engine and a real world example of how to use it. Tableau's data-handling engine is usually not well comprehended by even advanced authors because it's not an overt part of day-to-day activities; however, for the author who wants to truly grasp how to ready data for Tableau, this understanding is indispensable. In this section, we will explore Tableau's data-handling engine and how it enables structured yet organic data mining processes in the enterprise. To begin, let's clarify a term. The phrase Data-Handling Engine (DHE) in this context references how Tableau interfaces with and processes data. This interfacing and processing is comprised of three major parts: Connection, Metadata, and VizQL. Each part is described in detail in the following section. In other publications, Tableau's DHE may be referred to as a metadata model or the Tableau infrastructure. I've elected not to use either term because each is frequently defined differently in different contexts, which can be quite confusing. Tableau's DHE (that is, the engine for interfacing with and processing data) differs from other broadly considered solutions in the marketplace. Legacy business intelligence solutions often start with structuring the data for an entire enterprise. Data sources are identified, connections are established, metadata is defined, a model is created, and more. The upfront challenges this approach presents are obvious: highly skilled professionals, time-intensive rollout, and associated high startup costs. The payoff is a scalable, structured solution with detailed documentation and process control. Many next generation business intelligence platforms claim to minimize or completely do away with the need for structuring data. The upfront challenges are minimized: specialized skillsets are not required and the rollout time and associated startup costs are low. However, the initial honeymoon is short-lived, since the total cost of ownership advances significantly when difficulties are encountered trying to maintain and scale the solution. Tableau's infrastructure represents a hybrid approach, which attempts to combine the advantages of legacy business intelligence solutions with those of next-generation platforms, while minimizing the shortcomings of both. The philosophical underpinnings of Tableau's hybrid approach include the following: Infrastructure present in current systems should be utilized when advantageous Data models should be accessible by Tableau but not required DHE components as represented in Tableau should be easy to modify DHE components should be adjustable by business users The Tableau Data-Handling Engine The preceding diagram shows that the DHE consists of a run time module (VizQL) and two layers of abstraction (Metadata and Connection). Let's begin at the bottom of the graphic by considering the first layer of abstraction, Connection. The most fundamental aspect of the Connection is a path to the data source. The path should include attributes for the database, tables, and views as applicable. The Connection may also include joins, custom SQL, data-source filters, and more. In keeping with Tableau's philosophy of easy to modify and adjustable by business users (see the previous section), each of these aspects of the Connection is easily modifiable. For example, an author may choose to add an additional table to a join or modify a data-source filter. Note that the Connection does not contain any of the actual data. Although an author may choose to create a data extract based on data accessed by the Connection, that extract is separate from the connection. The next layer of abstraction is the metadata. The most fundamental aspect of the Metadata layer is the determination of each field as a measure or dimension. When connecting to relational data, Tableau makes the measure/dimension determination based on heuristics that consider the data itself as well as the data source's data types. Other aspects of the metadata include aliases, data types, defaults, roles, and more. Additionally, the Metadata layer encompasses author-generated fields such as calculations, sets, groups, hierarchies, bins, and so on. Because the Metadata layer is completely separate from the Connection layer, it can be used with other Connection layers; that is, the same metadata definitions can be used with different data sources. VizQL is generated when a user places a field on a shelf. The VizQL is then translated into Structured Query Language (SQL), Multidimensional Expressions(MDX), or Tableau Query Language (TQL) and passed to the backend data source via a driver. The following two aspects of the VizQL module are of primary importance: VizQL allows the author to change field attributions on the fly VizQL enables table calculations Let's consider each of these aspects of VizQL via examples: Changing field attribution example An analyst is considering infant mortality rates around the world. Using data from h t t p://d a t a . w o r l d b a n k . o r g /, they create the following worksheet by placing AVG(Infant Mortality Rate) and Country on the Columns and Rows shelves, respectively. AVG(Infant Mortality Rate) is, of course, treated as a measure in this case: Next they create a second worksheet to analyze the relationship between Infant Mortality Rate and Health Exp/Capita (that is, health expenditure per capita). In order to accomplish this, they define Infant Mortality Rate as a dimension, as shown in the following Screenshot: Studying the SQL generated by VizQL to create the preceding visualization is particularly Insightful: SELECT ['World Indicators$'].[Infant Mortality Rate] AS [Infant Mortality Rate], AVG(['World Indicators$'].[Health Exp/Capita]) AS [avg:Health Exp/Capita:ok] FROM [dbo].['World Indicators$'] ['World Indicators$'] GROUP BY ['World Indicators$'].[Infant Mortality Rate] The Group By clause clearly communicates that Infant Mortality Rate is treated as a dimension. The takeaway is to note that VizQL enabled the analyst to change the field usage from measure to dimension without adjusting the source metadata. This on-the-fly ability enables creative exploration of the data not possible with other tools and avoids lengthy exercises attempting to define all possible uses for each field. If you liked our article, be sure to check out Mastering Tableau which consists of more useful data visualization and data analysis techniques.  
Read more
  • 0
  • 0
  • 11096

article-image-getting-started-with-data-visualization-in-tableau
Amarabha Banerjee
13 Feb 2018
5 min read
Save for later

Getting started with Data Visualization in Tableau

Amarabha Banerjee
13 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an book extract from Mastering Tableau, written by David Baldwin. Tableau has emerged as one of the most popular Business Intelligence solutions in recent times, thanks to its powerful and interactive data visualization capabilities. This book will empower you to become a master in Tableau by exploiting the many new features introduced in Tableau 10.0.[/box] In today’s post, we shall explore data visualization basics with Tableau and explore a real world example using these techniques. Tableau Software has a focused vision resulting in a small product line. The main product (and hence the center of the Tableau universe) is Tableau Desktop. Assuming you are a Tableau author, that's where almost all your time will be spent when working with Tableau. But of course you must be able to connect to data and output the results. Thus, as shown in the following figure, the Tableau universe encompasses data sources, Tableau Desktop, and output channels, which include the Tableau Server family and Tableau Reader: Worksheet and dashboard creation At the heart of Tableau are worksheets and dashboards. Worksheets contain individual visualizations and dashboards contain one or more worksheets. Additionally, worksheets and dashboards may be combined into stories to communicate specific insights to the end user via a presentation environment. Lastly, all worksheets, dashboards, and stories are organized in workbooks that can be accessed via Tableau Desktop, Server, or Reader. In this section, we will look at worksheet and dashboard creation with the intent of not only communicating the basics, but also providing some insight that may prove helpful to even more seasoned Tableau authors. Worksheet creation At the most fundamental level, a visualization in Tableau is created by placing one or more fields on one or more shelves. To state this as a pseudo-equation: Field(s) + shelf(s) = Viz As an example, note that the visualization created in the following screenshot is generated by placing the Sales field on the Text shelf. Although the results are quite simple – a single number – this does qualify as a view. In other words, a Field (Sales) placed on a shelf (Text) has generated a Viz: Exercise – fundamentals of visualizations Let's explore the basics of creating a visualization via an exercise: Navigate to h t t p s ://p u b l i c . t a b l e a u . c o m /p r o f i l e /d a v i d 1. . b a l d w i n #!/ to locate and download the workbook associated with this chapter. In the workbook, find the tab labeled Fundamentals of Visualizations: Locate Region within the Dimensions portion of the Data pane: Drag Region to the Color shelf; that is, Region + Color shelf = what is shown in the following screenshot: Click on the Color shelf and then on Edit Colors… to adjust colors as desired: Next, move Region to the Size, Label/Text, Detail, Columns, and Rows shelves. After placing Region on each shelf, click on the shelf to access additional options. Lastly, choose other fields to drop on various shelves to continue exploring Tableau's behavior. As you continue exploring Tableau's behavior by dragging and dropping different fields onto different shelves, you will notice that Tableau responds with default behaviors. These defaults, however, can be overridden, which we will explore in the following section. Dashboard Creation Although, as stated previously, a dashboard contains one or more worksheets, dashboards are much more than static presentations. They are an essential part of Tableau's interactivity. In this section, we will populate a dashboard with worksheets and then deploy actions for interactivity. Exercise – building a dashboard In the workbook associated with this chapter, navigate to the tab entitled Building a Dashboard. Within the Dashboard pane located on the left-hand portion of the screen, double-click on each of the following worksheets (in the order in which they are listed) to add them to the dashboard: US Sales Customer Segment Scatter Plot Customers In the lower right-hand corner of the dashboard, click in the blank area below Profit Ratio to select the vertical container: After clicking in the blank area, you should see a blue border around the filter and the legends. This indicates that the vertical container is selected. As shown in the following screenshot, select the vertical container handle and drag it to the left-hand side of the Customers worksheet. Note the gray shading, which communicates where the container will be placed: The gray shading (provided by Tableau when dragging elements such as worksheets and containers onto a dashboard) helpfully communicates where the element will be placed. Take your time and observe carefully when placing an element on a dashboard or the results may be Unexpected. 5. Format the dashboard as desired. The following tips may prove helpful: Adjust the sizes of the elements on the screen by hovering over the edges between each element and then clicking and dragging as Desired. 2. Note that the Sales and Profit legends in the following screenshot are floating elements. Make an element float by right-clicking on the element handle and selecting Floating. (See the previous screenshot and note that the handle is located immediately above Region, in the upper-right-hand corner). 3. Create Horizontal and Vertical containers by dragging those objects from the bottom portion of the Dashboard pane. 4. Drag the edges of containers to adjust the size of each worksheet. 5. Display the dashboard title via Dashboard | Show Title…: If you enjoyed our post, be sure to check out Mastering Tableau which consists of many useful data visualization and data analysis techniques.  
Read more
  • 0
  • 0
  • 36691

article-image-estimating-population-statistics-point-estimation
Aaron Lazar
12 Feb 2018
5 min read
Save for later

Estimating population statistics with Point Estimation

Aaron Lazar
12 Feb 2018
5 min read
[box type="note" align="" class="" width=""]This article is an extract from the book Principles of Data Science, written by Sinan Ozdemir. The book is a great way to get into the field of data science. It takes a unique approach that bridges the gap between mathematics and computer science, taking you through the entire data science pipeline.[/box] In this extract, we’ll learn how to estimate population means, variances and other statistics using the Point Estimation method. For the code samples, we’ve used Python 2.7. A point estimate is an estimate of a population parameter based on sample data. To obtain these estimates, we simply apply the function that we wish to measure for our population to a sample of the data. For example, suppose there is a company of 9,000 employees and we are interested in ascertaining the average length of breaks taken by employees in a single day. As we probably cannot ask every single person, we will take a sample of the 9,000 people and take a mean of the sample. This sample mean will be our point estimate. The following code is broken into three parts: We will use the probability distribution, known as the Poisson distribution, to randomly generate 9,000 answers to the question: for how many minutes in a day do you usually take breaks? This will represent our "population". We will take a sample of 100 employees (using the Python random sample method) and find a point estimate of a mean (called a sample mean). Compare our sample mean (the mean of the sample of 100 employees) to our population mean. Let's take a look at the following code: np.random.seed(1234) long_breaks = stats.poisson.rvs(loc=10, mu=60, size=3000) # represents 3000 people who take about a 60 minute break The long_breaks variable represents 3000 answers to the question: how many minutes on an average do you take breaks for?, and these answers will be on the longer side. Let's see a visualization of this distribution, shown as follows: pd.Series(long_breaks).hist() We see that our average of 60 minutes is to the left of the distribution. Also, because we only sampled 3000 people, our bins are at their highest around 700-800 people. Now, let's model 6000 people who take, on an average, about 15 minutes' worth of breaks. Let's again use the Poisson distribution to simulate 6000 people, as shown: short_breaks = stats.poisson.rvs(loc=10, mu=15, size=6000) # represents 6000 people who take about a 15 minute break pd.Series(short_breaks).hist() Okay, so we have a distribution for the people who take longer breaks and a distribution for the people who take shorter breaks. Again, note how our average break length of 15 minutes falls to the left-hand side of the distribution, and note that the tallest bar is about 1600 people. breaks = np.concatenate((long_breaks, short_breaks)) # put the two arrays together to get our "population" of 9000 people The breaks variable is the amalgamation of all the 9000 employees, both long and short break takers. Let's see the entire distribution of people in a single visualization: pd.Series(breaks).hist() We see how we have two humps. On the left, we have our larger hump of people who take about a 15 minute break, and on the right, we have a smaller hump of people who take longer breaks. Later on, we will investigate this graph further. We can find the total average break length by running the following code: breaks.mean() # 39.99 minutes is our parameter Our average company break length is about 40 minutes. Remember that our population is the entire company's employee size of 9,000 people, and our parameter is 40 minutes. In the real world, our goal would be to estimate the population parameter because we would not have the resources to ask every single employee in a survey their average break length for many reasons. Instead, we will use a point estimate. So, to make our point, we want to simulate a world where we ask 100 random people about the length of their breaks. To do this, let's take a random sample of 100 employees out of the 9,000 employees we simulated, as shown: sample_breaks = np.random.choice(a = breaks, size=100) # taking a sample of 100 employees Now, let's take the mean of the sample and subtract it from the population mean and see how far off we were: breaks.mean() - sample_breaks.mean() # difference between means is 4.09 minutes, not bad! This is extremely interesting, because with only about 1% of our population (100 out of 9,000), we were able to get within 4 minutes of our population parameter and get a very accurate estimate of our population mean. Not bad! Here, we calculated a point estimate for the mean, but we can also do this for proportion parameters. By proportion, I am referring to a ratio of two quantitative values. Let's suppose that in a company of 10,000 people, our employees are 20% white, 10% black, 10% Hispanic, 30% Asian, and 30% identify as other. We will take a sample of 1,000 employees and see if their race proportions are similar. employee_races = (["white"]*2000) + (["black"]*1000) +         (["hispanic"]*1000) + (["asian"]*3000) +         (["other"]*3000) employee_races represents our employee population. For example, in our company of 10,000 people, 2,000 people are white (20%) and 3,000 people are Asian (30%). Let's take a random sample of 1,000 people, as shown: demo_sample = random.sample(employee_races, 1000) # Sample 1000 values for race in set(demo_sample): print( race + " proportion estimate:" ) print( demo_sample.count(race)/1000. ) The output obtained would be as follows: hispanic proportion estimate: 0.103 white proportion estimate: 0.192 other proportion estimate: 0.288 black proportion estimate: 0.1 asian proportion estimate: 0.317 We can see that the race proportion estimates are very close to the underlying population's proportions. For example, we got 10.3% for Hispanic in our sample and the population proportion for Hispanic was 10%. To summarize we can say that you’re familiar with point estimation method to estimate population means, variances and other statistics, and implement them in Python. If you found our post useful, you can check out Principles of Data Science for more interesting Data Science tips and techniques.
Read more
  • 0
  • 0
  • 101823
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-neural-network-architectures-101-understanding-perceptrons
Kunal Chaudhari
12 Feb 2018
9 min read
Save for later

Neural Network Architectures 101: Understanding Perceptrons

Kunal Chaudhari
12 Feb 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Neural Network Programming with Java Second Edition written by Fabio M. Soares and Alan M. F. Souza. This book is for Java developers who want to master developing smarter applications like weather forecasting, pattern recognition etc using neural networks. [/box] In this article we will discuss about perceptrons along with their features, applications and limitations. Perceptrons are a very popular neural network architecture that implements supervised learning. Projected by Frank Rosenblatt in 1957, it has just one layer of neurons, receiving a set of inputs and producing another set of outputs. This was one of the first representations of neural networks to gain attention, especially because of their simplicity. In our Java implementation, this is illustrated with one neural layer (the output layer). The following code creates a perceptron with three inputs and two outputs, having the linear function at the output layer: int numberOfInputs=3; int numberOfOutputs=2; Linear outputAcFnc = new Linear(1.0); NeuralNet perceptron = new NeuralNet(numberOfInputs,numberOfOutputs,             outputAcFnc); Applications and limitations However, scientists did not take long to conclude that a perceptron neural network could only be applied to simple tasks, according to that simplicity. At that time, neural networks were being used for simple classification problems, but perceptrons usually failed when faced with more complex datasets. Let's illustrate this with a very basic example (an AND function) to understand better this issue. Linear separation The example consists of an AND function that takes two inputs, x1 and x2. That function can be plotted in a two-dimensional chart as follows: And now let's examine how the neural network evolves the training using the perceptron rule, considering a pair of two weights, w1 and w2, initially 0.5, and bias valued 0.5 as well. Assume learning rate η equals 0.2: Epoch x1 x2 w1 w2 b y t E Δw1 Δw2 Δb 1 0 0 0.5 0.5 0.5 0.5 0 -0.5 0 0 -0.1 1 0 1 0.5 0.5 0.4 0.9 0 -0.9 0 -0.18 -0.18 1 1 0 0.5 0.32 0.22 0.72 0 -0.72 -0.144 0 -0.144 1 1 1 0.356 0.32 0.076 0.752 1 0.248 0.0496 0.0496 0.0496 2 0 0 0.406 0.370 0.126 0.126 0 -0.126 0.000 0.000 -0.025 2 0 1 0.406 0.370 0.100 0.470 0 -0.470 0.000 -0.094 -0.094 2 1 0 0.406 0.276 0.006 0.412 0 -0.412 -0.082 0.000 -0.082 2 1 1 0.323 0.276 -0.076 0.523 1 0.477 0.095 0.095 0.095 … … 89 0 0 0.625 0.562 -0.312 -0.312 0 0.312 0 0 0.062 89 0 1 0.625 0.562 -0.25 0.313 0 -0.313 0 -0.063 -0.063 89 1 0 0.625 0.500 -0.312 0.313 0 -0.313 -0.063 0 -0.063 89 1 1 0.562 0.500 -0.375 0.687 1 0.313 0.063 0.063 0.063 After 89 epochs, we find the network to produce values near to the desired output. Since in this example the outputs are binary (zero or one), we can assume that any value produced by the network that is below 0.5 is considered to be 0 and any value above 0.5 is considered to be 1. So, we can draw a function , with the final weights and bias found by the learning algorithm w1=0.562, w2=0.5 and b=-0.375, defining the linear boundary in the chart: This boundary is a definition of all classifications given by the network. You can see that the boundary is linear, given that the function is also linear. Thus, the perceptron network is really suitable for problems whose patterns are linearly separable. The XOR case Now let's analyze the XOR case: We see that in two dimensions, it is impossible to draw a line to separate the two patterns. What would happen if we tried to train a single layer perceptron to learn this function? Suppose we tried, let's see what happened in the following table: Epoch x1 x2 w1 w2 b y t E Δw1 Δw2 Δb 1 0 0 0.5 0.5 0.5 0.5 0 -0.5 0 0 -0.1 1 0 1 0.5 0.5 0.4 0.9 1 0.1 0 0.02 0.02 1 1 0 0.5 0.52 0.42 0.92 1 0.08 0.016 0 0.016 1 1 1 0.516 0.52 0.436 1.472 0 -1.472 -0.294 -0.294 -0.294 2 0 0 0.222 0.226 0.142 0.142 0 -0.142 0.000 0.000 -0.028 2 0 1 0.222 0.226 0.113 0.339 1 0.661 0.000 0.132 0.132 2 1 0 0.222 0.358 0.246 0.467 1 0.533 0.107 0.000 0.107 2 1 1 0.328 0.358 0.352 1.038 0 -1.038 -0.208 -0.208 -0.208 … … 127 0 0 -0.250 -0.125 0.625 0.625 0 -0.625 0.000 0.000 -0.125 127 0 1 -0.250 -0.125 0.500 0.375 1 0.625 0.000 0.125 0.125 127 1 0 -0.250 0.000 0.625 0.375 1 0.625 0.125 0.000 0.125 127 1 1 -0.125 0.000 0.750 0.625 0 -0.625 -0.125 -0.125 -0.125 The perceptron just could not find any pair of weights that would drive the following error 0.625. This can be explained mathematically as we already perceived from the chart that this function cannot be linearly separable in two dimensions. So what if we add another dimension? Let's see the chart in three dimensions: In three dimensions, it is possible to draw a plane that would separate the patterns, provided that this additional dimension could properly transform the input data. Okay, but now there is an additional problem: how could we derive this additional dimension since we have only two input variables? One obvious, but also workaround, answer would be adding a third variable as a derivation from the two original ones. And being this third variable a (derivation), our neural network would probably get the following shape: Okay, now the perceptron has three inputs, one of them being a composition of the other. This also leads to a new question: how should that composition be processed? We can see that this component could act as a neuron, so giving the neural network a nested architecture. If so, there would another new question: how would the weights of this new neuron be trained, since the error is on the output neuron? Multi-layer perceptrons As we can see, one simple example in which the patterns are not linearly separable has led us to more and more issue using the perceptron architecture. That need led to the application of multilayer perceptrons. The fact that the natural neural network is structured in layers as well, and each layer captures pieces of information from a specific environment is already established. In artificial neural networks, layers of neurons act in this way, by extracting and abstracting information from data, transforming them into another dimension or shape. In the XOR example, we found the solution to be the addition of a third component that would make possible a linear separation. But there remained a few questions regarding how that third component would be computed. Now let's consider the same solution as a two-layer perceptron: Now we have three neurons instead of just one, but in the output the information transferred by the previous layer is transformed into another dimension or shape, whereby it would be theoretically possible to establish a linear boundary on those data points. However, the question on finding the weights for the first layer remains unanswered, or can we apply the same training rule to neurons other than the output? We are going to deal with this issue in the Generalized delta rule section. MLP properties Multi-layer perceptrons can have any number of layers and also any number of neurons in each layer. The activation functions may be different on any layer. An MLP network is usually composed of at least two layers, one for the output and one hidden layer. There are also some references that consider the input layer as the nodes that collect input data; therefore, for those cases, the MLP is considered to have at least three layers. For the purpose of this article, let's consider the input layer as a special type of layer which has no weights, and as the effective layers, that is, those enabled to be trained, we'll consider the hidden and output layers. A hidden layer is called that because it actually hides its outputs from the external world. Hidden layers can be connected in series in any number, thus forming a deep neural network. However, the more layers a neural network has, the slower would be both training and running, and according to mathematical foundations, a neural network with one or two hidden layers at most may learn as well as deep neural networks with dozens of hidden layers. But it depends on several factors. MLP weights In an MLP feedforward network, one particular neuron i receives data from a neuron j of the previous layer and forwards its output to a neuron k of the next layer: The mathematical description of a neural network is recursive: Here, yo is the network output (should we have multiple outputs, we can replace yo with Y, representing a vector); fo is the activation function of the output; l is the number of hidden layers; nhi is the number of neurons in the hidden layer i; wi is the weight connecting the i th neuron of the last hidden layer to the output; fi is the activation function of the neuron i; and bi is the bias of the neuron i. It can be seen that this equation gets larger as the number of layers increases. In the last summing operation, there will be the inputs xi. Recurrent MLP The neurons on an MLP may feed signals not only to neurons in the next layers (feedforward network), but also to neurons in the same or previous layers (feedback or recurrent). This behavior allows the neural network to maintain state on some data sequence, and this feature is especially exploited when dealing with time series or handwriting recognition. Recurrent networks are usually harder to train, and eventually the computer may run out of memory while executing them. In addition, there are recurrent network architectures better than MLPs, such as Elman, Hopfield, Echo state, Bidirectional RNNs (recurrent neural networks). But we are not going to dive deep into these architectures. Coding an MLP Bringing these concepts into the OOP point of view, we can review the classes already designed so far: One can see that the neural network structure is hierarchical. A neural network is composed of layers that are composed of neurons. In the MLP architecture, there are three types of layers: input, hidden, and output. So suppose that in Java, we would like to define a neural network consisting of three inputs, one output (linear activation function) and one hidden layer (sigmoid function) containing five neurons. The resulting code would be as follows: int numberOfInputs=3; int numberOfOutputs=1; int[] numberOfHiddenNeurons={5};     Linear outputAcFnc = new Linear(1.0); Sigmoid hiddenAcFnc = new Sigmoid(1.0); NeuralNet neuralnet = new NeuralNet(numberOfInputs, numberOfOutputs, numberOfHiddenNeurons, hiddenAcFnc, outputAcFnc); To summarize, we saw how perceptrons can be applied to solve linear separation problems, their limitations in classifying nonlinear data and how to suppress those limitations with multi-layer perceptrons (MLPs). If you enjoyed this excerpt, check out the book Neural Network Programming with Java Second Edition for a better understanding of neural networks and how they fit in different real-world projects.
Read more
  • 0
  • 0
  • 23853

article-image-implementing-simple-time-series-data-analysis-r
Amarabha Banerjee
09 Feb 2018
4 min read
Save for later

Implementing a simple Time Series Data Analysis in R

Amarabha Banerjee
09 Feb 2018
4 min read
[box type="note" align="" class="" width=""]This article is extracted from the book Machine Learning with R written by Brett Lantz. This book will methodically take you through stages to apply machine learning for data analysis using R.[/box] In this article, we will explore the popular time series analysis method and its practical implementation using R. Introduction When we think about time, we think about years, days, months, hours, minutes, and seconds. Think of any datasets and you will find some attributes which will be in the form of time, especially data related to stock, sales, purchase, profit, and loss. All these have time associated with them. For example, the price of stock in the stock exchange at different points on a given day or month or year. Think of any industry domain, and sales are an important factor; you can see time series in sales, discounts, customers, and so on. Other domains include but are not limited to statistics, economics and budgets, processes and quality control, finance, weather forecasting, or any kind of forecasting, transport, logistics, astronomy, patient study, census analysis, and the list goes on. In simple words, it contains data or observations in time order, spaced at equal intervals. Time series analysis means finding the meaning in the time-related data to predict what will happen next or forecast trends on the basis of observed values. There are many methods to fit the time series, smooth the random variation, and get some insights from the dataset. When you look at time series data you can see the following: Trend: Long term increase or decrease in the observations or data. Pattern: Sudden spike in sales due to christmas or some other festivals, drug consumption increases due to some condition; this type of data has a fixed time duration and can be predicted for future time also. Cycle: Can be thought of as a pattern that is not fixed; it rises and falls without any pattern. Such time series involve a great fluctuation in data. How to do There are many datasets available with R that are of the time series types. Using the command class, one can know if the dataset is time series or not. We will look into the AirPassengers dataset that shows monthly air passengers in thousands from 1949 to 1960. We will also create new time series to represent the data. Perform the following commands in RStudio or R Console: > class(AirPassengers) Output: [1] "ts" > start(AirPassengers) Output: [1] 1949 1 > end(AirPassengers) Output: [1] 1960 12 > summary(AirPassengers) Output: Min. 1st Qu. Median Mean 3rd Qu. Max. 104.0 180.0 265.5 280.3 360.5 622.0 Analyzing Time Series Data [ 89 ] In the next recipe, we will create the time series and print it out. Let's think of the share price of some company in the range of 2,500 to 4,000 from 2011 to be recorded monthly. Perform the following coding in R: > my_vector = sample(2500:4000, 72, replace=T) > my_series = ts(my_vector, start=c(2011,1), end=c(2016,12), frequency = 12) > my_series Output: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2011 2888 3894 3675 3113 3421 3870 2644 2677 3392 2847 2543 3147 2012 2973 3538 3632 2695 3475 3971 2695 2963 3217 2836 3525 2895 2013 3984 3811 2902 3602 3812 3631 2625 3887 3601 2581 3645 3324 2014 3830 2821 3794 3942 3504 3526 3932 3246 3787 2894 2800 2732 2015 3326 3659 2993 2765 3881 3983 3813 3172 2667 3517 3445 2805 2016 3668 3948 2779 2881 3285 2733 3203 3329 3854 3285 3800 2563 How it works In the first recipe, we used the AirPassengers dataset, using the class function. We saw that it is ts (ts stands for time series). The start and end functions will give the starting year and ending year of the dataset with the values. The frequency function tells us the interval of observations; 1 means annually, 4 means quarterly, 12 means yearly, and so on. In the next recipe, we want to generate samples between 2,500 to 40,000 to represent the price of a share. Using a sample function, we can create a sample; it takes the range as the first argument, and the number of samples required as the second argument. The last argument decides whether duplication is to be allowed in the sample or not. We stored the sample in the my_vector. Now we create a time series using the ts function. The ts function takes the vector as an argument followed by the start and end to show the period for which the time series is being constructed. The frequency specifies the number of observations in the start and end to be recorded. 12. To summarize we talked about how R can be utilized to perform time series analysis in different ways. If you would like to learn more useful machine learning techniques in R, be sure to check out Machine Learning with R.      
Read more
  • 0
  • 0
  • 26086

article-image-explaining-data-exploration-in-under-a-minute
Amarabha Banerjee
08 Feb 2018
5 min read
Save for later

Explaining Data Exploration in under a minute

Amarabha Banerjee
08 Feb 2018
5 min read
[box type="note" align="" class="" width=""]Below given article is taken from the book Machine Learning with R written by Brett Lantz. This book will help you harness the power of R for statistical computing and data science.[/box] Today, we shall explore different data exploration techniques and a real world example of using these techniques. Introduction Data Exploration is a term used for finding insightful information from data. To find insights from data various steps such as data munging, data analysis, data modeling, and model evaluation are taken. In any real data exploration project, commonly six steps are involved in the exploration process. They are as follows: Asking the right questions: Asking the right questions will help in understanding the objective and target information sought from the data. Questions can be asked such as What are my expected findings after the exploration is finished?, or What kind of information can I extract through the exploration? Data collection: Once the right questions have been asked the target of exploration is cleared. Data collected from various sources is in unorganized and diverse format. Data may come from various sources such as files, databases, internet, and so on. Data collected in this way is raw data and needs to be processed to extract meaningful information. Most of the analysis and visualizing tools or applications expect data to be in a certain format to generate results and hence the raw data is of no use for them. Data munging: Raw data collected needs to be converted into the desired format of the tools to be used. In this phase, raw data is passed through various processes such as parsing the data, sorting, merging, filtering, dealing with missing values, and so on. The main aim is to transform raw data in the format that the analyzing and visualizing tools understand. Once the data is compatible with the tools, analysis and visualizing tools are used to generate the different results. Basic exploratory data analysis: Once the data munging is done and data is formating for the tools, it can be used to perform data exploration and analysis. Tools provide various methods and techniques to do the same. Most analyzing tools allow statistical functions to be performed on the data. Visualizing tools help in visualizing the data in different ways. Using basic statistical operations and visualizing the same data can be understood in better way. Advanced exploratory data analysis: Once the basic analysis is done it's time to look at an advanced stage of analysis. In this stage, various prediction models are formed on basis of requirement. Machine learning algorithms are utilized to train the model and generate the inferences. Various tuning on the model is also done to ensure correctness and effectiveness of the model. Model assessment: When the models are mare, they are evaluated to find the best model from the given different models. The major factor to decide the best model is to see how perfect or closely it can predict the values. Models are tuned here also for increasing the accuracy and effectiveness. Various plots and graphs are used to see the model’s prediction. Real world example - using Air Quality Dataset Air quality datasets come bundled with R. They contain data about the New York Air Quality Measurements of 1973 for five months from May to September recorded daily. To view all the available datasets use the data() function, it will display all the datasets available with R installation. How to do it Perform the following step to see all the datasets in R and using airquality: > data() > str(airquality) Output 'data.frame': 153 obs. of 6 variables: $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ... $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ... $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ... $ Temp : int 67 72 74 62 56 66 65 59 61 69 ... $ Month : int 5 5 5 5 5 5 5 5 5 5 ... $ Day : int 1 2 3 4 5 6 7 8 9 10 ... > head(airquality) Output Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 How it works The str command is used to display the structure of the dataset, as you can see it contains the information about the observation of ozone, solar, wind, and temp attributes recorded each day for five months. Using the head function, you can see the first few lines of actual data. The dataset is very basic and is enough to start processing and analyzing data at a very basic level. Kaggle website, which has various diverse kinds of datasets. Apart from datasets it also holds many competitions in data science fields to solve real-world problems. You can find the competitions, datasets, kernels, and jobs at https://www. kaggle.com/. Many competitions are organized by large corporate bodies, government agencies, or from academia. Many of the competitions have prize money associated with them. The following screenshot shows competitions and prize money. You can simply create an account and start participating in competitions by submitting code and the output and the same will be assessed. Assessment or evaluation criteria is available on the detail page of each competition. By participating and using https:/ / www.kaggle. com/ one gains experience in solving real-world problems. It gives you a taste of what data scientist do. On the jobs page various jobs for data scientists and analysis is listed and you can apply if the profile is suitable or matches with your interests. If you liked our post, be sure to check out Machine Learning with R which consists of more useful machine learning techniques with R.  
Read more
  • 0
  • 0
  • 2110

article-image-building-linear-regression-model-python-developers
Pravin Dhandre
07 Feb 2018
7 min read
Save for later

Building a Linear Regression Model in Python for developers

Pravin Dhandre
07 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rodolfo Bonnin titled Machine Learning for Developers. This book is a systematic guide for developers to implement various Machine Learning techniques and develop efficient and intelligent applications.[/box] Let’s start using one of the most well-known toy datasets, explore it, and select one of the dimensions to learn how to build a linear regression model for its values. Let's start by importing all the libraries (scikit-learn, seaborn, and matplotlib); one of the excellent features of Seaborn is its ability to define very professional-looking style settings. In this case, we will use the whitegrid style: import numpy as np from sklearn import datasets import seaborn.apionly as sns %matplotlib inline import matplotlib.pyplot as plt sns.set(style='whitegrid', context='notebook') The Iris Dataset It’s time to load the Iris dataset. This is one of the most well-known historical datasets. You will find it in many books and publications. Given the good properties of the data, it is useful for classification and regression examples. The Iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris) contains 50 records for each of the three types of iris, 150 lines in a total over five fields. Each line is a measurement of the following: Sepal length in cm Sepal width in cm Petal length in cm Petal width in cm The final field is the type of flower (setosa, versicolor, or virginica). Let’s use the load_dataset method to create a matrix of values from the dataset: iris2 = sns.load_dataset('iris') In order to understand the dependencies between variables, we will implement the covariance operation. It will receive two arrays as parameters and will return the covariance(x,y) value: def covariance (X, Y): xhat=np.mean(X) yhat=np.mean(Y) epsilon=0 for x,y in zip (X,Y): epsilon=epsilon+(x-xhat)*(y-yhat) return epsilon/(len(X)-1) Let's try the implemented function and compare it with the NumPy function. Note that we calculated cov (a,b), and NumPy generated a matrix of all the combinations cov(a,a), cov(a,b), so our result should be equal to the values (1,0) and (0,1) of that matrix: print (covariance ([1,3,4], [1,0,2])) print (np.cov([1,3,4], [1,0,2])) 0.5 [[ 2.33333333   0.5              ] [ 0.5                   1.                ]] Having done a minimal amount of testing of the correlation function as defined earlier, receive two arrays, such as covariance, and use them to get the final value: def correlation (X, Y): return (covariance(X,Y)/(np.std(X,    ddof=1)*np.std(Y,   ddof=1))) ##We have to indicate ddof=1 the unbiased std Let’s test this function with two sample arrays, and compare this with the (0,1) and (1,0) values of the correlation matrix from NumPy: print (correlation ([1,1,4,3], [1,0,2,2])) print (np.corrcoef ([1,1,4,3], [1,0,2,2])) 0.870388279778 [[ 1.                     0.87038828] [ 0.87038828   1.                ]] Getting an intuitive idea with Seaborn pairplot A very good idea when starting worke on a problem is to get a graphical representation of all the possible variable combinations. Seaborn’s pairplot function provides a complete graphical summary of all the variable pairs, represented as scatterplots, and a representation of the univariate distribution for the matrix diagonal. Let’s look at how this plot type shows all the variables dependencies, and try to look for a linear relationship as a base to test our regression methods: sns.pairplot(iris2, size=3.0) <seaborn.axisgrid.PairGrid at 0x7f8a2a30e828> Pairplot of all the variables in the dataset. Lets' select two variables that, from our initial analysis, have the property of being linearly dependent. They are petal_width and petal_length: X=iris2['petal_width'] Y=iris2['petal_length'] Let’s now take a look at this variable combination, which shows a clear linear tendency: plt.scatter(X,Y) This is the representation of the chosen variables, in a scatter type graph: This is the current distribution of data that we will try to model with our linear prediction function. Creating the prediction function First, let's define the function that will abstractedly represent the modeled data, in the form of a linear function, with the form y=beta*x+alpha: def predict(alpha, beta, x_i): return beta * x_i + alpha Defining the error function It’s now time to define the function that will show us the difference between predictions and the expected output during training. We have two main alternatives: measuring the absolute difference between the values (or L1), or measuring a variant of the square of the difference (or L2). Let’s define both versions, including the first formulation inside the second: def error(alpha, beta, x_i, y_i): #L1 return y_i - predict(alpha, beta, x_i) def sum_sq_e(alpha, beta, x, y): #L2 return sum(error(alpha, beta, x_i, y_i) ** 2 for x_i, y_i in zip(x, y)) Correlation fit Now, we will define a function implementing the correlation method to find the parameters for our regression: def correlation_fit(x, y): beta = correlation(x, y) * np.std(y, ddof=1) / np.std(x,ddof=1) alpha = np.mean(y) - beta * np.mean(x) return alpha, beta Let’s then run the fitting function and print the guessed parameters: alpha, beta = correlation_fit(X, Y) print(alpha) print(beta) 1.08355803285 2.22994049512 Let’s now graph the regressed line with the data in order to intuitively show the appropriateness of the solution: plt.scatter(X,Y) xr=np.arange(0,3.5) plt.plot(xr,(xr*beta)+alpha) This is the final plot we will get with our recently calculated slope and intercept: Final regressed line Polynomial regression and an introduction to underfitting and overfitting When looking for a model, one of the main characteristics we look for is the power of generalizing with a simple functional expression. When we increase the complexity of the model, it's possible that we are building a model that is good for the training data, but will be too optimized for that particular subset of data. Underfitting, on the other hand, applies to situations where the model is too simple, such as this case, which can be represented fairly well with a simple linear model. In the following example, we will work on the same problem as before, using the scikit- learn library to search higher-order polynomials to fit the incoming data with increasingly complex degrees. Going beyond the normal threshold of a quadratic function, we will see how the function looks to fit every wrinkle in the data, but when we extrapolate, the values outside the normal range are clearly out of range: from sklearn.linear_model import Ridge from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline ix=iris2['petal_width'] iy=iris2['petal_length'] # generate points used to represent the fitted function x_plot = np.linspace(0, 2.6, 100) # create matrix versions of these arrays X = ix[:, np.newaxis] X_plot = x_plot[:, np.newaxis] plt.scatter(ix, iy, s=30, marker='o', label="training points") for count, degree in enumerate([3, 6, 20]): model = make_pipeline(PolynomialFeatures(degree), Ridge()) model.fit(X, iy) y_plot = model.predict(X_plot) plt.plot(x_plot, y_plot, label="degree %d" % degree) plt.legend(loc='upper left') plt.show() The combined graph shows how the different polynomials' coefficients describe the data population in different ways. The 20 degree polynomial shows clearly how it adjusts perfectly for the trained dataset, and after the known values, it diverges almost spectacularly, going against the goal of generalizing for future data. Curve fitting of the initial dataset, with polynomials of increasing values With this, we successfully explored how to develop an efficient linear regression model in Python and how you can make predictions using the designed model. We've reviewed ways to identify and optimize the correlation between the prediction and the expected output using simple and definite functions. If you enjoyed our post, you must check out Machine Learning for Developers to uncover advanced tools for building machine learning applications on your fingertips.  
Read more
  • 0
  • 1
  • 7566
article-image-how-to-create-a-neural-network-in-tensorflow
Aaron Lazar
06 Feb 2018
8 min read
Save for later

How to Create a Neural Network in TensorFlow

Aaron Lazar
06 Feb 2018
8 min read
[box type="note" align="" class="" width=""]This article has been extracted from the book Principles of Data Science authored by Sinan Ozdemir. With a unique approach that bridges the gap between mathematics and computer science, the books takes you through the entire data science pipeline. Beginning with cleaning and preparing data, and effective data mining strategies and techniques to help you get to grips with machine learning.[/box] In this article, we’re going to learn how to create a neural network whose goal will be to classify images. Tensorflow is an open-source machine learning module that is used primarily for its simplified deep learning and neural network abilities. I would like to take some time to introduce the module and solve a few quick problems using tensorflow. Let’s begin with some imports: from sklearn import datasets, metrics import tensorflow as tf import numpy as np from sklearn.cross_validation import train_test_split %matplotlib inline Loading our iris dataset: # Our data set of iris flowers iris = datasets.load_iris() # Load datasets and split them for training and testing X_train, X_test, y_train, y_test = train_test_split(iris.data, iris. target) Creating the Neural Network: # Specify that all features have real-value datafeature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)] optimizer = tf.train.GradientDescentOptimizer(learning_rate=.1) # Build 3 layer DNN with 10, 20, 10 units respectively. classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[10, 20, 10], optimizer=optimizer, n_classes=3) # Fit model. classifier.fit(x=X_train, y=y_train, steps=2000) Notice that our code really hasn't changed from the last segment. We still have our feature_columns from before, but now we introduce, instead of a linear classifier, a DNNClassifier, which stands for Deep Neural Network Classifier. This is TensorFlow's syntax for implementing a neural network. Let's take a closer look: tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[10, 20, 10], optimizer=optimizer, n_classes=3) We see that we are inputting the same feature_columns, n_classes, and optimizer, but see how we have a new parameter called hidden_units? This list represents the number of nodes to have in each layer between the input and the output layer. All in all, this neural network will have five layers: The first layer will have four nodes, one for each of the iris feature variables. This layer is the input layer. A hidden layer of 10 nodes. A hidden layer of 20 nodes. A hidden layer of 10 nodes. The final layer will have three nodes, one for each possible outcome of the network. This is called our output layer. Now that we've trained our model, let's evaluate it on our test set: # Evaluate accuracy. accuracy_score = classifier.evaluate(x=X_test, y=y_test)["accuracy"] print('Accuracy: {0:f}'.format(accuracy_score)) Accuracy: 0.921053 Hmm, our neural network didn't do so well on this dataset, but perhaps it is because the network is a bit too complicated for such a simple dataset. Let's introduce a new dataset that has a bit more to it… The MNIST dataset consists of over 50,000 handwritten digits (0-9) and the goal is to recognize the handwritten digits and output which letter they are writing. Tensorflow has a built-in mechanism for downloading and loading these images. from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("MNIST_data/", one_hot=False) Extracting MNIST_data/train-images-idx3-ubyte.gz Extracting MNIST_data/train-labels-idx1-ubyte.gz Extracting MNIST_data/t10k-images-idx3-ubyte.gz Extracting MNIST_data/t10k-labels-idx1-ubyte.gz Notice that one of our inputs for downloading mnist is called one_hot. This parameter either brings in the dataset's target variable (which is the digit itself) as a single number or has a dummy variable. For example, if the first digit were a 7, the target would either be: 7: If one_hot was false 0 0 0 0 0 0 0 1 0 0: If one_hot was true (notice that starting from 0, the seventh index is a 1) We will encode our target the former way, as this is what our tensorflow neural network and our sklearn logistic regression will expect. The dataset is split up already into a training and test set, so let's create new variables to hold them: x_mnist = mnist.train.images y_mnist = mnist.train.labels.astype(int) For the y_mnist variable, I specifically cast every target as an integer (by default they come in as floats) because otherwise tensorflow would throw an error at us. Out of curiosity, let's take a look at a single image: import matplotlib.pyplot as plt plt.imshow(x_mnist[10].reshape(28, 28)) And hopefully our target variable matches at the 10th index as well: y_mnist[10] 0 Excellent! Let's now take a peek at how big our dataset is: x_mnist.shape (55000, 784) y_mnist.shape (55000,) Our training size then is 55000 images and target variables. Let's fit a deep neural network to our images and see if it will be able to pick up on the patterns in our inputs: # Specify that all features have real-value data feature_columns = [tf.contrib.layers.real_valued_column("", dimension=784)] optimizer = tf.train.GradientDescentOptimizer(learning_rate=.1) # Build 3 layer DNN with 10, 20, 10 units respectively. classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,     hidden_units=[10, 20, 10],   optimizer=optimizer, n_classes=10) # Fit model. classifier.fit(x=x_mnist,       y=y_mnist,       steps=1000) # Warning this is veryyyyyyyy slow This code is very similar to our previous segment using DNNClassifier; however, look how in our first line of code, I have changed the number of columns to be 784 while in the classifier itself, I changed the number of output classes to be 10. These are manual inputs that tensorflow must be given to work. The preceding code runs very slowly. It is little by little adjusting itself in order to get the best possible performance from our training set. Of course, we know that the ultimate test here is testing our network on an unknown test set, which is also given to us from tensorflow: x_mnist_test = mnist.test.images y_mnist_test = mnist.test.labels.astype(int) x_mnist_test.shape (10000, 784) y_mnist_test.shape (10000,) So we have 10,000 images to test on; let's see how our network was able to adapt to the dataset: # Evaluate accuracy. accuracy_score = classifier.evaluate(x=x_mnist_test, y=y_mnist_test)["accuracy"] print('Accuracy: {0:f}'.format(accuracy_score)) Accuracy: 0.920600 Not bad, 92% accuracy on our dataset. Let's take a second and compare this performance to a standard sklearn logistic regression now: logreg = LogisticRegression() logreg.fit(x_mnist, y_mnist) # Warning this is slow y_predicted = logreg.predict(x_mnist_test) from sklearn.metrics import accuracy_score # predict on our test set, to avoid overfitting! accuracy = accuracy_score(y_predicted, y_mnist_test) # get our accuracy score Accuracy 0.91969 Success! Our neural network performed better than the standard logistic regression. This is likely because the network is attempting to find relationships between the pixels themselves and using these relationships to map them to what digit we are writing down. In logistic regression, the model assumes that every single input is independent of one another, and therefore has a tough time finding relationships between them. There are ways of making our neural network learn differently: We could make our network wider, that is, increase the number of nodes in the hidden layers instead of having several layers of a smaller number of nodes: # A wider network feature_columns = [tf.contrib.layers.real_valued_column("", dimension=784)] optimizer = tf.train.GradientDescentOptimizer(learning_rate=.1) # Build 3 layer DNN with 10, 20, 10 units respectively. classifier = tf.contrib.learn.DNNClassifier(feature_ columns=feature_columns,      hidden_units=[1500],       optimizer=optimizer,    n_classes=10) # Fit model. classifier.fit(x=x_mnist,       y=y_mnist,       steps=100) # Warning this is veryyyyyyyy slow # Evaluate accuracy. accuracy_score = classifier.evaluate(x=x_mnist_test,    y=y_mnist_test)["accuracy"] print('Accuracy: {0:f}'.format(accuracy_score)) Accuracy: 0.898400 We could increase our learning rate, forcing the network to attempt to converge into an answer faster. As mentioned before, we run the risk of the model skipping the answer entirely if we go down this route. It is usually better to stick with a smaller learning rate. We can change the method of optimization. Gradient descent is very popular; however, there are other algorithms for doing so. One example is called the Adam Optimizer. The difference is in the way they traverse the error function, and therefore the way that they approach the optimization point. Different problems in different domains call for different optimizers. There is no replacement for a good old fashioned feature selection phase instead of attempting to let the network figure everything out for us. We can take the time to find relevant and meaningful features that actually will allow our network to find an answer quicker! There you go! You’ve now learned how to build a neural net in Tensorflow! If you liked this tutorial and would like to learn more, head over and grab the copy Principles of Data Science. If you want to take things a bit further and learn how to classify Irises using multi-layer perceptrons, head over here.    
Read more
  • 0
  • 0
  • 36307

article-image-implementing-fault-tolerance-in-spark-streaming-data-processing-applications-with-apache-kafka
Pravin Dhandre
01 Feb 2018
16 min read
Save for later

Implementing fault-tolerance in Spark Streaming data processing applications with Apache Kafka

Pravin Dhandre
01 Feb 2018
16 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Rajanarayanan Thottuvaikkatumana titled Apache Spark 2 for Beginners. This book is a developer’s guide for developing large-scale and distributed data processing applications in their business environment. [/box] Data processing is generally carried in two ways, either in batch or stream processing. This article will help you learn how to start processing your data uninterruptedly and build fault-tolerance as and when the data gets generated in real-time Message queueing systems with publish-subscribe capability are generally used for processing messages. The traditional message queueing systems failed to perform because of the huge volume of messages to be processed per second for the needs of large-scale data processing applications. Kafka is a publish-subscribe messaging system used by many IoT applications to process a huge number of messages. The following capabilities of Kafka made it one of the most widely used messaging systems: Extremely fast: Kafka can process huge amounts of data by handling reading and writing in short intervals of time from many application clients Highly scalable: Kafka is designed to scale up and scale out to form a cluster using commodity hardware Persists a huge number of messages: Messages reaching Kafka topics are persisted into the secondary storage, while at the same time it is handling huge number of messages flowing through The following are some of the important elements of Kafka, and are terms to be understood before proceeding further: Producer: The real source of the messages, such as weather sensors or mobile phone network Broker: The Kafka cluster, which receives and persists the messages published to its topics by various producers Consumer: The data processing applications subscribed to the Kafka topics that consume the messages published to the topics The same log event processing application use case discussed in the preceding section is used again here to elucidate the usage of Kafka with Spark Streaming. Instead of collecting the log event messages from the TCP socket, here the Spark Streaming data processing application will act as a consumer of a Kafka topic and the messages published to the topic will be consumed. The Spark Streaming data processing application uses the version 0.8.2.2 of Kafka as the message broker, and the assumption is that the reader has already installed Kafka, at least in a standalone mode. The following activities are to be performed to make sure that Kafka is ready to process the messages produced by the producers and that the Spark Streaming data processing application can consume those messages: Start the Zookeeper that comes with Kafka installation. Start the Kafka server. Create a topic for the producers to send the messages to. Pick up one Kafka producer and start publishing log event messages to the newly created topic. Use the Spark Streaming data processing application to process the log eventspublished to the newly created topic. Starting Zookeeper and Kafka The following scripts are run from separate terminal windows in order to start Zookeeper and the Kafka broker, and to create the required Kafka topics: $ cd $KAFKA_HOME $ $KAFKA_HOME/bin/zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties [2016-07-24 09:01:30,196] INFO binding to port 0.0.0.0/0.0.0.0:2181 (org.apache.zookeeper.server.NIOServerCnxnFactory) $ $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties [2016-07-24 09:05:06,381] INFO 0 successfully elected as leader (kafka.server.ZookeeperLeaderElector) [2016-07-24 09:05:06,455] INFO [Kafka Server 0], started (kafka.server.KafkaServer) $ $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 1 --topic sfb Created topic "sfb". $ $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic sfb The Kafka message producer can be any application capable of publishing messages to the Kafka topics. Here, the kafka-console-producer that comes with Kafka is used as the producer of choice. Once the producer starts running, whatever is typed into its console window will be treated as a message that is published to the chosen Kafka topic. The Kafka topic is given as a command line argument when starting the kafka-console-producer. The submission of the Spark Streaming data processing application that consumes log event messages produced by the Kafka producer is slightly different from the application covered in the preceding section. Here, many Kafka jar files are required for the data processing. Since they are not part of the Spark infrastructure, they have to be submitted to the Spark cluster. The following jar files are required for the successful running of this application: $KAFKA_HOME/libs/kafka-clients-0.8.2.2.jar $KAFKA_HOME/libs/kafka_2.11-0.8.2.2.jar $KAFKA_HOME/libs/metrics-core-2.2.0.jar $KAFKA_HOME/libs/zkclient-0.3.jar Code/Scala/lib/spark-streaming-kafka-0-8_2.11-2.0.0-preview.jar Code/Python/lib/spark-streaming-kafka-0-8_2.11-2.0.0-preview.jar In the preceding list of jar files, the maven repository co-ordinate for spark-streamingkafka-0-8_2.11-2.0.0-preview.jar is "org.apache.spark" %% "sparkstreaming-kafka-0-8" % "2.0.0-preview". This particular jar file has to be downloaded and placed in the lib folder of the directory structure given in Figure 4. It is being used in the submit.sh and the submitPy.sh scripts, which submit the application to the Spark cluster. The download URL for this jar file is given in the reference section of this chapter. In the submit.sh and submitPy.sh files, the last few lines contain a conditional statement looking for the second parameter value of 1 to identify this application and ship the required jar files to the Spark cluster. Implementing the application in Scala The following code snippet is the Scala code for the log event processing application that processes the messages produced by the Kafka producer. The use case of this application is the same as the one discussed in the preceding section concerning windowing operations: /** The following program can be compiled and run using SBT Wrapper scripts have been provided with this The following script can be run to compile the code ./compile.sh The following script can be used to run this application in Spark. The  second command line argument of value 1 is very important. This is to flag the shipping of the kafka jar files to the Spark cluster ./submit.sh com.packtpub.sfb.KafkaStreamingApps 1 **/ package com.packtpub.sfb import java.util.HashMap import org.apache.spark.streaming._ import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.streaming.kafka._ import org.apache.kafka.clients.producer.{ProducerConfig, KafkaProducer, ProducerRecord} object KafkaStreamingApps { def main(args: Array[String]) { // Log level settings LogSettings.setLogLevels() // Variables used for creating the Kafka stream //The quorum of Zookeeper hosts val zooKeeperQuorum = "localhost" // Message group name val messageGroup = "sfb-consumer-group" //Kafka topics list separated by coma if there are multiple topics to be listened on val topics = "sfb" //Number of threads per topic val numThreads = 1 // Create the Spark Session and the spark context val spark = SparkSession .builder .appName(getClass.getSimpleName) .getOrCreate() // Get the Spark context from the Spark session for creating the streaming context val sc = spark.sparkContext // Create the streaming context val ssc = new StreamingContext(sc, Seconds(10)) // Set the check point directory for saving the data to recover when there is a crash ssc.checkpoint("/tmp") // Create the map of topic names val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap // Create the Kafka stream val appLogLines = KafkaUtils.createStream(ssc, zooKeeperQuorum, messageGroup, topicMap).map(_._2) // Count each log messge line containing the word ERROR val errorLines = appLogLines.filter(line => line.contains("ERROR")) // Print the line containing the error errorLines.print() // Count the number of messages by the windows and print them errorLines.countByWindow(Seconds(30), Seconds(10)).print() // Start the streaming ssc.start() // Wait till the application is terminated ssc.awaitTermination() } } Compared to the Scala code in the preceding section, the major difference is in the way the stream is created. Implementing the application in Python The following code snippet is the Python code for the log event processing application that processes the message produced by the Kafka producer. The use case of this application is also the same as the one discussed in the preceding section concerning windowing operations: # The following script can be used to run this application in Spark # ./submitPy.sh KafkaStreamingApps.py 1 from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils if __name__ == "__main__": # Create the Spark context sc = SparkContext(appName="PythonStreamingApp") # Necessary log4j logging level settings are done log4j = sc._jvm.org.apache.log4j log4j.LogManager.getRootLogger().setLevel(log4j.Level.WARN) # Create the Spark Streaming Context with 10 seconds batch interval ssc = StreamingContext(sc, 10) # Set the check point directory for saving the data to recover when there is a crash ssc.checkpoint("tmp") # The quorum of Zookeeper hosts zooKeeperQuorum="localhost" # Message group name messageGroup="sfb-consumer-group" # Kafka topics list separated by coma if there are multiple topics to be listened on topics = "sfb" # Number of threads per topic numThreads = 1 # Create a Kafka DStream kafkaStream = KafkaUtils.createStream(ssc, zooKeeperQuorum, messageGroup, {topics: numThreads}) # Create the Kafka stream appLogLines = kafkaStream.map(lambda x: x[1]) # Count each log messge line containing the word ERROR errorLines = appLogLines.filter(lambda appLogLine: "ERROR" in appLogLine) # Print the first ten elements of each RDD generated in this DStream to the console errorLines.pprint() errorLines.countByWindow(30,10).pprint() # Start the streaming ssc.start() # Wait till the application is terminated ssc.awaitTermination() The following commands are run on the terminal window to run the Scala application: $ cd Scala $ ./submit.sh com.packtpub.sfb.KafkaStreamingApps 1 The following commands are run on the terminal window to run the Python application: $ cd Python $ ./submitPy.sh KafkaStreamingApps.py 1 When both of the preceding programs are running, whatever log event messages are typed into the console window of the Kafka console producer, and invoked using the following command and inputs, will be processed by the application. The outputs of this program will be very similar to the ones that are given in the preceding section: $ $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic sfb [Fri Dec 20 01:46:23 2015] [ERROR] [client 1.2.3.4.5.6] Directory index forbidden by rule: /home/raj/ [Fri Dec 20 01:46:23 2015] [WARN] [client 1.2.3.4.5.6] Directory index forbidden by rule: /home/raj/ [Fri Dec 20 01:54:34 2015] [ERROR] [client 1.2.3.4.5.6] Directory index forbidden by rule: /apache/web/test Spark provides two approaches to process Kafka streams. The first one is the receiver-based approach that was discussed previously and the second one is the direct approach. This direct approach to processing Kafka messages is a simplified method in which Spark Streaming is using all the possible capabilities of Kafka just like any of the Kafka topic consumers, and polls for the messages in the specific topic, and the partition by the offset number of the messages. Depending on the batch interval of the Spark Streaming data processing application, it picks up a certain number of offsets from the Kafka cluster, and this range of offsets is processed as a batch. This is highly efficient and ideal for processing messages with a requirement to have exactly-once processing. This method also reduces the Spark Streaming library's need to do additional work to implement the exactly-once semantics of the message processing and delegates that responsibility to Kafka. The programming constructs of this approach are slightly different in the APIs used for the data processing. Consult the appropriate reference material for the details. The preceding sections introduced the concept of a Spark Streaming library and discussed some of the real-world use cases. There is a big difference between Spark data processing applications developed to process static batch data and those developed to process dynamic stream data in a deployment perspective. The availability of data processing applications to process a stream of data must be constant. In other words, such applications should not have components that are single points of failure. The following section is going to discuss this topic. Spark Streaming jobs in production When a Spark Streaming application is processing the incoming data, it is very important to have uninterrupted data processing capability so that all the data that is getting ingested is processed. In business-critical streaming applications, most of the time missing even one piece of data can have a huge business impact. To deal with such situations, it is important to avoid single points of failure in the application infrastructure. From a Spark Streaming application perspective, it is good to understand how the underlying components in the ecosystem are laid out so that the appropriate measures can be taken to avoid single points of failure. A Spark Streaming application deployed in a cluster such as Hadoop YARN, Mesos or Spark Standalone mode has two main components very similar to any other type of Spark application: Spark driver: This contains the application code written by the user Executors: The executors that execute the jobs submitted by the Spark driver But the executors have an additional component called a receiver that receives the data getting ingested as a stream and saves it as blocks of data in memory. When one receiver is receiving the data and forming the data blocks, they are replicated to another executor for fault-tolerance. In other words, in-memory replication of the data blocks is done onto a different executor. At the end of every batch interval, these data blocks are combined to form a DStream and sent out for further processing downstream. Figure 1 depicts the components working together in a Spark Streaming application infrastructure deployed in a cluster: In Figure 1, there are two executors. The receiver component is deliberately not displayed in the second executor to show that it is not using the receiver and instead just collects the replicated data blocks from the other executor. But when needed, such as on the failure of the first executor, the receiver in the second executor can start functioning. Implementing fault-tolerance in Spark Streaming data processing applications Spark Streaming data processing application infrastructure has many moving parts. Failures can happen to any one of them, resulting in the interruption of the data processing. Typically failures can happen to the Spark driver or the executors. When an executor fails, since the replication of data is happening on a regular basis, the task of receiving the data stream will be taken over by the executor on which the data was getting replicated. There is a situation in which when an executor fails, all the data that is unprocessed will be lost. To circumvent this problem, there is a way to persist the data blocks into HDFS or Amazon S3 in the form of write-ahead logs. When the Spark driver fails, the driven program is stopped, all the executors lose connection, and they stop functioning. This is the most dangerous situation. To deal with this situation, some configuration and code changes are necessary. The Spark driver has to be configured to have an automatic driver restart, which is supported by the cluster managers. This includes a change in the Spark job submission method to have the cluster mode in whichever may be the cluster manager. When a restart of the driver happens, to start from the place when it crashed, a checkpointing mechanism has to be implemented in the driver program. This has already been done in the code samples that are used. The following lines of code do that job: ssc = StreamingContext(sc, 10) ssc.checkpoint("tmp") From an application coding perspective, the way the StreamingContext is created is slightly different. Instead of creating a new StreamingContext every time, the factory method getOrCreate of the StreamingContext is to be used with a function, as shown in the following code segment. If that is done, when the driver is restarted, the factory method will check the checkpoint directory to see whether an earlier StreamingContext was in use, and, if found in the checkpoint data, it is created. Otherwise, a new StreamingContext is created. The following code snippet gives the definition of a function that can be used with the getOrCreate factory method of the StreamingContext. As mentioned earlier, a detailed treatment of these aspects is beyond the scope of this book: /** * The following function has to be used when the code is being restructured to have checkpointing and driver recovery * The way it should be used is to use the StreamingContext.getOrCreate with this function and do a start of that */ def sscCreateFn(): StreamingContext = { // Variables used for creating the Kafka stream // The quorum of Zookeeper hosts val zooKeeperQuorum = "localhost" // Message group name val messageGroup = "sfb-consumer-group" //Kafka topics list separated by coma if there are multiple topics to be listened on val topics = "sfb" //Number of threads per topic val numThreads = 1 // Create the Spark Session and the spark context val spark = SparkSession .builder .appName(getClass.getSimpleName) .getOrCreate() // Get the Spark context from the Spark session for creating the streaming context val sc = spark.sparkContext // Create the streaming context val ssc = new StreamingContext(sc, Seconds(10)) // Create the map of topic names val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap // Create the Kafka stream val appLogLines = KafkaUtils.createStream(ssc, zooKeeperQuorum, messageGroup, topicMap).map(_._2) // Count each log messge line containing the word ERROR val errorLines = appLogLines.filter(line => line.contains("ERROR")) // Print the line containing the error errorLines.print() // Count the number of messages by the windows and print them errorLines.countByWindow(Seconds(30), Seconds(10)).print() // Set the check point directory for saving the data to recover when there is a crash ssc.checkpoint("/tmp") // Return the streaming context ssc } At a data source level, it is a good idea to build parallelism for faster data processing and, depending on the source of data, this can be accomplished in different ways. Kafka inherently supports partition at the topic level, and that kind of scaling out mechanism supports a good amount of parallelism. As a consumer of Kafka topics, the Spark Streaming data processing application can have multiple receivers by creating multiple streams, and the data generated by those streams can be combined by the union operation on the Kafka streams. The production deployment of Spark Streaming data processing applications is to be done purely based on the type of application that is being used. Some of the guidelines given previously are just introductory and conceptual in nature. There is no silver bullet approach to solving production deployment problems, and they have to evolve along with the application development. To summarize, we looked at the production deployment of Spark Streaming data processing applications and the possible ways of implementing fault-tolerance in Spark Streaming and data processing applications using Kafka. To explore more critical and equally important Spark tools such as Spark GraphX, Spark MLlib, DataFrames etc, do check out Apache Spark 2 for Beginners  to develop efficient large-scale applications with Apache Spark.  
Read more
  • 0
  • 0
  • 17218

article-image-how-to-run-spark-in-mesos
Sunith Shetty
31 Jan 2018
6 min read
Save for later

How to run Spark in Mesos

Sunith Shetty
31 Jan 2018
6 min read
This article is an excerpt from a book written by Muhammad Asif Abbasi titled Learning Apache Spark 2. In this book, you will learn how to perform big data analytics using Spark streaming, machine learning techniques and more. From the article given below, you will learn how to operate Spark in Mesos cluster manager. What is Mesos? Mesos is an open source cluster manager started as a UC Berkley research project in 2008 and quite widely used by a number of organizations. Spark supports Mesos, and Matei Zahria has given a keynote at Mesos Con in June of 2016. Here is a link to the YouTube video of the keynote. Before you start If you haven't installed Mesos previously, the getting started page on the Apache website gives a good walk through of installing Mesos on Windows, MacOS, and Linux. Follow the URL https://mesos.apache.org/getting-started/. Once installed you need to start-up Mesos on your cluster Starting Mesos Master: ./bin/mesos-master.sh -ip=[MasterIP] -workdir=/var/lib/mesos Start Mesos Agents on all your worker nodes: ./bin/mesos-agent.sh - master=[MasterIp]:5050 -work-dir=/var/lib/mesos Make sure Mesos is up and running with all your relevant worker nodes configured: http://[MasterIP]@5050 Make sure that Spark binary packages are available and accessible by Mesos. They can be placed on a Hadoop-accessible URI for example: HTTP via http:// S3 via s3n:// HDFS via hdfs:// You can also install spark in the same location on all the Mesos slaves, and configure spark.mesos.executor.home to point to that location. Running in Mesos Mesos can have single or multiple masters, which means the Master URL differs when submitting application from Spark via mesos: Single Master Mesos://sparkmaster:5050 Multiple Masters (Using Zookeeper) Mesos://zk://master1:2181, master2:2181/mesos Modes of operation in Mesos Mesos supports both the Client and Cluster modes of operation: Client mode Before running the client mode, you need to perform couple of configurations: Spark-env.sh Export MESOS_NATIVE_JAVA_LIBRARY=<Path to libmesos.so [Linux]> or <Path to libmesos.dylib[MacOS]> Export SPARK_EXECUTOR_URI=<URI of Spark zipped file uploaded to an accessible location e.g. HTTP, HDFS, S3> Set spark.executor.uri to URI of Spark zipped file uploaded to an accessible location e.g. HTTP, HDFS, S3 Batch Applications For batch applications, in your application program you need to pass on the Mesos URL as the master when creating your Spark context. As an example: val sparkConf = new SparkConf()                .setMaster("mesos://mesosmaster:5050")                .setAppName("Batch Application")                .set("spark.executor.uri", "Location to Spark binaries                (Http, S3, or HDFS)") val sc = new SparkContext(sparkConf) If you are using Spark-submit, you can configure the URI in the conf/sparkdefaults.conf file using spark.executor.uri. Interactive applications When you are running one of the provided spark shells for interactive querying, you can pass the master argument e.g: ./bin/spark-shell -master mesos://mesosmaster:5050 Cluster mode Just as in YARN, you run spark on mesos in a cluster mode, which means the driver is launched inside the cluster and the client can disconnect after submitting the application, and get results from the Mesos WebUI. Steps to use the cluster mode Start the MesosClusterDispatcher in your cluster: ./sbin/start-mesos-dispatcher.sh -master mesos://mesosmaster:5050. This will generally start the dispatcher at port 7077. From the client, submit a job to the mesos cluster by Spark-submit specifying the dispatcher URL. Example:        ./bin/spark-submit        --class org.apache.spark.examples.SparkPi        --master mesos://dispatcher:7077        --deploy-mode cluster        --supervise        --executor-memory 2G        --total-executor-cores 10        s3n://path/to/examples.jar Similar to Spark Mesos has lots of properties that can be set to optimize the processing. You should refer to the Spark Configuration page (http://spark.apache.org/docs/latest/configuration.html) for more Information. Mesos run modes Spark can run on Mesos in two modes: Coarse Grained (default-mode): Spark will acquire a long running Mesos task on each machine. This offers a much cost of statup, but the resources will continue to be allocated to spark for the complete duration of the application. Fine Grained (deprecated): The fine grained mode is deprecated as in this case each mesos task is created per Spark task. The benefit of this is each application receives cores as per its requirements, but the initial bootstrapping might act as a deterrent for interactive applications. Key Spark on Mesos configuration properties While Spark has a number of properties that can be configured to optimize Spark processing, some of these properties are specific to Mesos. We'll look at few of those key properties here. Property Name Meaning/Default Value spark.mesos.coarse Setting it to true (default value), will run Mesos in coarse grained mode. Setting it to false will run it in fine-grained mode. spark.mesos.extra.cores This is more of an advertisement rather than allocation in order to improve parallelism. An executor will pretend that it has extra cores resulting in the driver sending it more work. Default=0 spark.mesos.mesosExecutor.cores Only works in fine grained mode. This specifies how many cores should be given to each Mesos executor. spark.mesos.executor.home Identifies the directory of Spark installation for the executors in Mesos. As discussed, you can specify this using spark.executor.uri as well, however if you have not specified it, you can specify it using this property. spark.mesos.executor.memoryOverhead The amount of memory (in MBs) to be allocated per executor. spark.mesos.uris A comma separated list of URIs to be downloaded when the driver or executor is launched by Mesos. spark.mesos.prinicipal The name of the principal used by Spark to authenticate itself with Mesos.   You can find other configuration properties at the Spark documentation page (http://spark.apache.org/docs/latest/running-on-mesos.html#spark-properties). To summarize, we covered the objective to get you started with running Spark on Mesos. To know more about Spark SQL, Spark Streaming, Machine Learning with Spark, you can refer to the book Learning Apache Spark 2.
Read more
  • 0
  • 0
  • 11523
article-image-getting-started-with-data-storytelling
Aaron Lazar
28 Jan 2018
11 min read
Save for later

Getting Started with Data Storytelling

Aaron Lazar
28 Jan 2018
11 min read
[box type="note" align="" class="" width=""]This article has been taken from the book Principles of Data Science, written by Sinan Ozdemir. It aims to practically introduce you to the different ways in which you can communicate or visualize your data to tell stories effectively.[/box] Communication matters Being able to conduct experiments and manipulate data in a coding language is not enough to conduct practical and applied data science. This is because data science is, generally, only as good as how it is used in practice. For instance, a medical data scientist might be able to predict the chance of a tourist contracting Malaria in developing countries with >98% accuracy, however, if these results are published in a poorly marketed journal and online mentions of the study are minimal, their groundbreaking results that could potentially prevent deaths would never see the true light of day. For this reason, communication of results through data storytelling is arguably as important as the results themselves. A famous example of poor management of distribution of results is the case of Gregor Mendel. Mendel is widely recognized as one of the founders of modern genetics. However, his results (including data and charts) were not well adopted until after his death. Mendel even sent them to Charles Darwin, who largely ignored Mendel's papers, which were written in unknown Moravian journals. Generally, there are two ways of presenting results: verbal and visual. Of course, both the verbal and visual forms of communication can be broken down into dozens of subcategories, including slide decks, charts, journal papers, and even university lectures. However, we can find common elements of data presentation that can make anyone in the field more aware and effective in their communication skills. Let's dive right into effective (and ineffective) forms of communication, starting with visuals. We’ll look at four basic types of graphs: scatter plots, line graphs, bar charts, histograms, and box plots. Scatter plots A scatter plot is probably one of the simplest graphs to create. It is made by creating two quantitative axes and using data points to represent observations. The main goal of a scatter plot is to highlight relationships between two variables and, if possible, reveal a correlation. For example, we can look at two variables: average hours of TV watched in a day and a 0-100 scale of work performance (0 being very poor performance and 100 being excellent performance). The goal here is to find a relationship (if it exists) between watching TV and average work performance. The following code simulates a survey of a few people, in which they revealed the amount of television they watched, on an average, in a day against a company-standard work performance metric: import pandas as pd hours_tv_watched = [0, 0, 0, 1, 1.3, 1.4, 2, 2.1, 2.6, 3.2, 4.1, 4.4, 4.4, 5] This line of code is creating 14 sample survey results of people answering the question of how many hours of TV they watch in a day. work_performance = [87, 89, 92, 90, 82, 80, 77, 80, 76, 85, 80, 75, 73, 72] This line of code is creating 14 new sample survey results of the same people being rated on their work performance on a scale from 0 to 100. For example, the first person watched 0 hours of TV a day and was rated 87/100 on their work, while the last person watched, on an average, 5 hours of TV a day and was rated 72/100: df = pd.DataFrame({'hours_tv_watched':hours_tv_watched, 'work_ performance':work_performance}) Here, we are creating a Dataframe in order to ease our exploratory data analysis and make it easier to make a scatter plot: df.plot(x='hours_tv_watched', y='work_performance', kind='scatter') Now, we are actually making our scatter plot. In the following plot, we can see that our axes represent the number of hours of TV watched in a day and the person's work performance metric: Each point on a scatter plot represents a single observation (in this case a person) and its location is a result of where the observation stands on each variable. This scatter plot does seem to show a relationship, which implies that as we watch more TV in the day, it seems to affect our work performance. Of course, as we are now experts in statistics from the last two chapters, we know that this might not be causational. A scatter plot may only work to reveal a correlation or an association between but not a causation. Advanced statistical tests, such as the ones we saw in Chapter 8, Advanced Statistics, might work to reveal causation. Later on in this chapter, we will see the damaging effects that trusting correlation might have. Line graphs Line graphs are, perhaps, one of the most widely used graphs in data communication. A line graph simply uses lines to connect data points and usually represents time on the x axis. Line graphs are a popular way to show changes in variables over time. The line graph, like the scatter plot, is used to plot quantitative variables. As a great example, many of us wonder about the possible links between what we see on TV and our behavior in the world. A friend of mine once took this thought to an extreme—he wondered if he could find a relationship between the TV show, The X-Files, and the amount of UFO sightings in the U.S.. He then found the number of sightings of UFOs per year and plotted them over time. He then added a quick graphic to ensure that readers would be able to identify the point in time when the X-files were released: It appears to be clear that right after 1993, the year of the X-Files premier, the number of UFO sightings started to climb drastically. This graphic, albeit light-hearted, is an excellent example of a simple line graph. We are told what each axis measures, we can quickly see a general trend in the data, and we can identify with the author's intent, which is to show a relationship between the number of UFO sightings and the X-files premier. On the other hand, the following is a less impressive line chart: This line graph attempts to highlight the change in the price of gas by plotting three points in time. At first glance, it is not much different than the previous graph—we have time on the bottom x axis and a quantitative value on the vertical y axis. The (not so) subtle difference here is that the three points are equally spaced out on the x axis; however, if we read their actual time indications, they are not equally spaced out in time. A year separates the first two points whereas a mere 7 days separates the last two points. Bar charts We generally turn to bar charts when trying to compare variables across different groups. For example, we can plot the number of countries per continent using a bar chart. Note how the x axis does not represent a quantitative variable, in fact, when using a bar chart, the x axis is generally a categorical variable, while the y axis is quantitative. Note that, for this code, I am using the World Health Organization's report on alcohol consumption around the world by country: drinks = pd.read_csv('data/drinks.csv') drinks.continent.value_counts().plot(kind='bar', title='Countries per Continent') plt.xlabel('Continent') plt.ylabel('Count') The following graph shows us a count of the number of countries in each continent. We can see the continent code at the bottom of the bars and the bar height represents the number of countries we have in each continent. For example, we see that Africa has the most countries represented in our survey, while South America has the least: In addition to the count of countries, we can also plot the average beer servings per continent using a bar chart, as shown: drinks.groupby('continent').beer_servings.mean().plot(kind='bar') Note how a scatter plot or a line graph would not be able to support this data because they can only handle quantitative variables; bar graphs have the ability to demonstrate categorical values. We can also use bar charts to graph variables that change over time, like a line graph. Histograms Histograms show the frequency distribution of a single quantitative variable by splitting up the data, by range, into equidistant bins and plotting the raw count of observations in each bin. A histogram is effectively a bar chart where the x axis is a bin (subrange) of values and the y axis is a count. As an example, I will import a store's daily number of unique customers, as shown: rossmann_sales = pd.read_csv('data/rossmann.csv') rossmann_sales.head() Note how we have multiple store data (by the first Store column). Let's subset this data for only the first store, as shown: first_rossmann_sales = rossmann_sales[rossmann_sales['Store']==1] Now, let's plot a histogram of the first store's customer count: first_rossmann_sales['Customers'].hist(bins=20) plt.xlabel('Customer Bins') plt.ylabel('Count') The x axis is now categorical in that each category is a selected range of values, for example, 600-620 customers would potentially be a category. The y axis, like a bar chart, is plotting the number of observations in each category. In this graph, for example, one might take away the fact that most of the time, the number of customers on any given day will fall between 500 and 700. Altogether, histograms are used to visualize the distribution of values that a quantitative variable can take on. Box plots Box plots are also used to show a distribution of values. They are created by plotting the five number summary, as follows: The minimum value The first quartile (the number that separates the 25% lowest values from the rest) The median The third quartile (the number that separates the 25% highest values from the rest) The maximum value In Pandas, when we create box plots, the red line denotes the median, the top of the box (or the right if it is horizontal) is the third quartile, and the bottom (left) part of the box is the first quartile. The following is a series of box plots showing the distribution of beer consumption according to continents: drinks.boxplot(column='beer_servings', by='continent') Now, we can clearly see the distribution of beer consumption across the seven continents and how they differ. Africa and Asia have a much lower median of beer consumption than Europe or North America. Box plots also have the added bonus of being able to show outliers much better than a histogram. This is because the minimum and maximum are parts of the box plot. Getting back to the customer data, let's look at the same store customer numbers, but using a box plot: first_rossmann_sales.boxplot(column='Customers', vert=False) This is the exact same data as plotted earlier in the histogram; however, now it is shown as a box plot. For the purpose of comparison, I will show you both the graphs one after the other: Note how the x axis for each graph are the same, ranging from 0 to 1,200. The box plot is much quicker at giving us a center of the data, the red line is the median, while the histogram works much better in showing us how spread out the data is and where people's biggest bins are. For example, the histogram reveals that there is a very large bin of zero people. This means that for a little over 150 days of data, there were zero customers. Note that we can get the exact numbers to construct a box plot using the describe feature in Pandas, as shown: first_rossmann_sales['Customers'].describe() min 0.000000 25% 463.000000 50% 529.000000 75% 598.750000 max 1130.000000 There we have it! We just learned data storytelling through various techniques like scatter plots, line graphs, bar charts, histograms and box plots. Now you’ve got the power to be creative in the way you tell tales of your data! If you found our article useful, you can check out Principles of Data Science for more interesting Data Science tips and techniques.    
Read more
  • 0
  • 0
  • 16096

article-image-how-to-build-a-gaussian-mixture-model
Gebin George
27 Jan 2018
8 min read
Save for later

How to build a Gaussian Mixture Model

Gebin George
27 Jan 2018
8 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book authored by Osvaldo Martin titled Bayesian Analysis with Python. This book will help you implement Bayesian analysis in your application and will guide you to build complex statistical problems using Python.[/box] Our article teaches you to build an end to end gaussian mixture model with a practical example. The general idea when building a finite mixture model is that we have a certain number of subpopulations, each one represented by some distribution, and we have data points that belong to those distribution but we do not know to which distribution each point belongs. Thus we need to assign the points properly. We can do that by building a hierarchical model. At the top level of the model, we have a random variable, often referred as a latent variable, which is a variable that is not really observable. The function of this latent variable is to specify to which component distribution a particular observation is assigned to. That is, the latent variable decides which component distribution we are going to use to model a given data point. In the literature, people often use the letter z to indicate latent variables. Let us start building mixture models with a very simple example. We have a dataset that we want to describe as being composed of three Gaussians. clusters = 3 n_cluster = [90, 50, 75] n_total = sum(n_cluster) means = [9, 21, 35] std_devs = [2, 2, 2] mix = np.random.normal(np.repeat(means, n_cluster),  np.repeat(std_devs, n_cluster)) sns.kdeplot(np.array(mix)) plt.xlabel('$x$', fontsize=14) In many real situations, when we wish to build models, it is often more easy, effective and productive to begin with simpler models and then add complexity, even if we know from the beginning that we need something more complex. This approach has several advantages, such as getting familiar with the data and problem, developing intuition, and avoiding choking us with complex models/codes that are difficult to debug. So, we are going to begin by supposing that we know that our data can be described using three Gaussians (or in general, k-Gaussians), maybe because we have enough previous experimental or theoretical knowledge to reasonably assume this, or maybe we come to that conclusion by eyeballing the data. We are also going to assume we know the mean and standard deviation of each Gaussian. Given this assumptions the problem is reduced to assigning each point to one of the three possible known Gaussians. There are many methods to solve this task. We of course are going to take the Bayesian track and we are going to build a probabilistic model. To develop our model, we can get ideas from the coin-flipping problem. Remember that we have had two possible outcomes and we used the Bernoulli distribution to describe them. Since we did not know the probability of getting heads or tails, we use a beta prior distribution. Our current problem with the Gaussians mixtures is similar, except that we now have k-Gaussian outcomes. The generalization of the Bernoulli distribution to k-outcomes is the categorical distribution and the generalization of the beta distribution is the Dirichlet distribution. This distribution may look a little bit weird at first because it lives in the simplex, which is like an n-dimensional triangle; a 1-simplex is a line, a 2-simplex is a triangle, a 3-simplex a tetrahedron, and so on. Why a simplex? Intuitively, because the output of this distribution is a k-length vector, whose elements are restricted to be positive and sum up to one. To understand how the Dirichlet generalize the beta, let us first refresh a couple of features of the beta distribution. We use the beta for 2-outcome problems, one with probability p and the other 1-p. In this sense we can think that the beta returns a two-element vector, [p, 1-p]. Of course, in practice, we omit 1-p because it is fully determined by p. Another feature of the beta distribution is that it is parameterized using two scalars  and . How does these features compare to the Dirichlet distribution? Let us think of the simplest Dirichlet distribution, one we could use to model a three-outcome problem. We get a Dirichlet distribution that returns a three element vector [p, q , r], where r=1 – (p+q). We could use three scalars to parameterize such Dirichlet and we may call them , , and ; however, it does not scale well to higher dimensions, so we just use a vector named  with lenght k, where k is the number of outcomes. Note that we can think of the beta and Dirichlet as distributions over probabilities. To get an idea about this distribution pay attention to the following figure and try to relate each triangular subplot to a beta distribution with similar parameters. The preceding figure is the output of the code written by Thomas Boggs with just a few minor tweaks. You can find the code in the accompanying text; also check the Keep reading sections for details. Now that we have a better grasp of the Dirichlet distribution we have all the elements to build our mixture model. One way to visualize it, is as a k-side coin flip model on top of a Gaussian estimation model. Of course, instead of k-sided coins The rounded-corner box is indicating that we have k-Gaussian likelihoods (with their corresponding priors) and the categorical variables decide which of them we use to describe a given data point. Remember, we are assuming we know the means and standard deviations of the Gaussians; we just need to assign each data point to one Gaussian. One detail of the following model is that we have used two samplers, Metropolis and ElemwiseCategorical, which is specially designed to sample discrete variables with pm.Model() as model_kg: p = pm.Dirichlet('p', a=np.ones(clusters))     category = pm.Categorical('category', p=p, shape=n_total)    means = pm.math.constant([10, 20, 35]) y = pm.Normal('y', mu=means[category], sd=2, observed=mix) step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters))    step2 = pm.Metropolis(vars=[p])    trace_kg = pm.sample(10000, step=[step1, step2])      chain_kg = trace_kg[1000:]       varnames_kg = ['p']    pm.traceplot(chain_kg, varnames_kg)   Now that we know the skeleton of a Gaussian mixture model, we are going to add a complexity layer and we are going to estimate the parameters of the Gaussians. We are going to assume three different means and a single shared standard deviation. As usual, the model translates easily to the PyMC3 syntax. with pm.Model() as model_ug: p = pm.Dirichlet('p', a=np.ones(clusters)) category = pm.Categorical('category', p=p, shape=n_total)    means = pm.Normal('means', mu=[10, 20, 35], sd=2, shape=clusters)    sd = pm.HalfCauchy('sd', 5) y = pm.Normal('y', mu=means[category], sd=sd, observed=mix)    step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters))    step2 = pm.Metropolis(vars=[means, sd, p])    trace_ug = pm.sample(10000, step=[step1, step2]) Now we explore the trace we got: chain = trace[1000:] varnames = ['means', 'sd', 'p'] pm.traceplot(chain, varnames) And a tabulated summary of the inference: pm.df_summary(chain, varnames)   mean sd mc_error hpd_2.5 hpd_97.5 means__0 21.053935 0.310447 0.012280 20.495889 21.735211 means__1 35.291631 0.246817 0.008159 34.831048 35.781825 means__2 8.956950 0.235121 0.005993 8.516094 9.429345 sd 2.156459 0.107277 0.002710 1.948067 2.368482 p__0 0.235553 0.030201 0.000793 0.179247 0.297747 p__1 0.349896 0.033905 0.000957 0.281977 0.412592 p__2 0.347436 0.032414 0.000942 0.286669 0.410189 Now we are going to do a predictive posterior check to see what our model learned from the data: ppc = pm.sample_ppc(chain, 50, model) for i in ppc['y']:    sns.kdeplot(i, alpha=0.1, color='b') sns.kdeplot(np.array(mix), lw=2, color='k') plt.xlabel('$x$', fontsize=14) Notice how the uncertainty, represented by the lighter blue lines, is smaller for the smaller and larger values of  and is higher around the central Gaussian. This makes intuitive sense since the regions of higher uncertainty correspond to the regions where the Gaussian overlaps and hence it is harder to tell if a point belongs to one or the other Gaussian. I agree that this is a very simple problem and not that much of a challenge, but it is a problem that contributes to our intuition and a model that can be easily applied or extended to more complex problems. We saw how to build a gaussian mixture model using a very basic model as an example, which can be applied to solve more complex models. If you enjoyed this excerpt, check out the book Bayesian Analysis with Python to understand the Bayesian framework and solve complex statistical problems using Python.    
Read more
  • 0
  • 0
  • 54379
Modal Close icon
Modal Close icon