Data | Tech News, Tutorials & Expert Insights

article-image-highlights-from-mary-meekers-2019-internet-trends-report

12 Jun 2019

8 min read

Highlights from Mary Meeker’s 2019 Internet trends report

12 Jun 2019

At Recode by Vox’s 2019 Code Conference on Tuesday, Bond partner Mary Meeker made her presentation onstage, covering everything on the internet's latest trends. Meeker had first started presenting these reports in 1995, underlining the most important statistics and technology trends on the internet. Last year in September, Meeker quit Kleiner Perkins to start her own firm Bond and is popularly known as the Queen of the Internet. Mary Meeker’s 2019 Internet trends report highlighted that the internet is continuing to grow, slowly, as more users come online, especially with mobile devices. She also talked about increased internet ad spending, data growth, as well as the rise of freemium subscription business models, interactive gaming, the on-demand economy and more. https://youtu.be/G_dwZB5h56E The internet trends highlighted by Meeker include: Internet Users E-commerce and advertising Internet Usage Freemium business models Data growth Jobs and Work Online Education Immigration and Healthcare Internet Users More than 50% of the world’s population now has access to the internet. There are 3.8 billion internet users in the world with Asia-pacific leading in both users and potential. China is the largest market with 21% of total internet users and India is at 12%. However, the growth is slowing by 6% in 2018 versus 7% in 2017 because so many people have come online that new users are harder to come by. New smartphone unit shipments actually declined in 2018. Per the global internet market cap leaders, the U.S. is stable at 18 of the top 30 and China is stable at 7 of the top 30. These are the two leading countries where internet innovation is at an especially high level. If we look at revenue growth for the internet market cap leaders it continues to slow - 11 percent year-on-year in Q1 versus 13 percent in Q4. Internet usage Internet usage had a solid growth, driven by investment in innovation. The digital media usage in the U.S. is accelerating up 7% versus 5% growth in 2017. The average US adult spends 6.3 hours each day with digital media, over half of which is spent on their mobiles. Wearables had 52 million users which doubled in four years. Roughly 70 million people globally listen to podcasts in the US, a figure that’s doubled in about four years. Outside the US, there's especially high innovation in data-driven and direct fulfillment that's growing very rapidly in China. Innovation outside the US is also especially strong in financial services. Images are also becoming an increasingly relevant way to communicate. More than 50% of the tweets of impressions today are images, video or other forms of media. Interactive gaming innovation is rising across platforms as interactive games like Fortnite become the new social media for certain people. It is accelerating with 2.4 billion users up, 6 percent year-on-year in 2018. On the flip side Almost 26% of adults are constantly online versus 21% three years ago. That number jumped to 39% for 18 to 29 year-olds surveyed. However, digital media users are taking action to reduce their usage and businesses are also taking actions to help users monitor their usage. Social media usage has decelerated up 1% in 2018 versus 6% in 2017. Privacy concerns are high but they're moderating. Regulators and businesses are improving consumer privacy control. In digital media encrypted messaging and traffic are rising rapidly. In Q1, 87 percent of global web traffic was encrypted, up from 53 percent three years ago. Another usage concern is problematic content. Problematic content on the Internet can be less filtered and more amplified. Images and streaming can be more powerful than text. Algorithms can amplify users on patterns and social media can amplify trending topics. Bad actors can amplify ideologies, unintended bad actors can amplify misinformation and extreme views can amplify polarization. However internet platforms are indeed driving efforts to reduce problematic content as do consumers and businesses. 88% percent of people in the U.S. believe the Internet has been mostly good for them and 70% believe the Internet has been mostly good for society. Cyber attacks have continued to rise. These include state-sponsored attacks, large-scale data provider attacks, and monetary extortion attacks. E-commerce and online advertising E-commerce is now 15 percent of retail sales. Its growth has slowed — up 12.4 percent in Q1 compared with a year earlier — but still towers over growth in regular retail, which was just 2 percent in Q1. In online advertising, on comparing the amount of media time spent versus the amount of advertising dollars spent, mobile hit equilibrium in 2018 while desktop hit that equilibrium point in 2015. The Internet ads spending on an annual basis accelerated a little bit in 2018 up 22 percent. Most of the spending is still on Google and Facebook, but companies like Amazon and Twitter are getting a growing share. Some 62 percent of all digital display ad buying is for programmatic ads, which will continue to grow. According to the leading tech companies the internet average revenue has been decelerating on a quarterly basis of 20 percent in Q1. Google and Facebook still account for the majority of online ad revenue, but the growth of US advertising platforms like Amazon, Twitter, Snapchat, and Pinterest is outstripping the big players: Google’s ad revenue grew 1.4 times over the past nine quarters and Facebook’s grew 1.9 times, while the combined group of new players grew 2.6 times. Customer acquisition costs — the marketing spending necessary to attract each new customer — is going up. That’s unsustainable because in some cases it surpasses the long-term revenue those customers will bring. Meeker suggests cheaper ways to acquire customers, like free trials and unpaid tiers. Freemium business models Freemium business models are growing and scaling. Freemium businesses equals free user experience which enables more usage, engagement, social sharing and network effects. It also equals premium user experience which drives monetization and product innovation. Freemium business evolution started in gaming, evolving and emerging in consumer and enterprise. One of the important factors for this growth is cloud deployment revenue which grew about 58% year-over-year. Another enabler of freemium subscription business models is efficient digital payments which account for more than 50% of day-to-day transactions around the world. Data growth Internet trends indicate that a number of data plumbers are helping a lot of companies collect data, manage connections, and optimize data. In a survey of retail customers, 91% preferred brands that provided personalized offers and recommendations. 83% were willing to passively share data in exchange for personalized services and 74% were willing to actively share data in exchange for personalized experiences. Data volume and utilization is also evolving rapidly. Enterprise surpassed consumer in 2018 and cloud is overtaking both. More data is now stored in the cloud than on private enterprise servers or consumer devices. Jobs and Work Strong economic indicators, internet enabled services, and jobs are helping work. If we look at global GDP. China, the US and India are rising, but Europe is falling. Cross-border trade is at 29% of global GDP and has been growing for many years. Global relative unemployment concerns are very high outside the US and low in itself. Consumer confidence index is high and rising. Unemployment is at a 19-year low but job openings are at an all-time high and wages are rising. On-demand work is creating internet-enabled opportunities and efficiencies. There are 7 million on-demand workers up 22 percent year-on-year. Remote work is also creating internet enabled work opportunities and efficiency. Americans working remotely have risen from 5 percent versus 3 percent in 2000. Online education Education costs and student debt are rising in the US whereas post-secondary education enrollment is slowing. Online education enrollment is high across a diverse base of universities - public, private for-profit, and private not-for-profit. Top offline institutions are ramping their online offerings at a very rapid rate - most recently University of Pennsylvania, University of London, University of Michigan and UC Boulder. Google's growth in creating certificates for in-demand jobs is growing rapidly which they are doing in collaboration with Coursera. Immigration and Healthcare In the U.S. 60% of the most highly valued tech companies are founded by first or second generation Americans. They employed 1.9 million people last year. USA entitlements account for 61% of government spending versus 42% 30 years ago, and shows no signs of stopping. Healthcare is steadily digitizing, driven by consumers and the trends are very powerful. You can expect more telemedicine and on-demand consultations. For details and infographics, we recommend you to go through the slide deck of the Internet trends report. What Elon Musk and South African conservation can teach us about technology forecasting. Jim Balsillie on Data Governance Challenges and 6 Recommendations to tackle them Experts present the most pressing issues facing global lawmakers on citizens’ privacy, democracy and the rights to freedom of speech.

0
0
10903

Packt

02 Sep 2015

24 min read

Big Data

Packt

02 Sep 2015

24 min read

In this article by Henry Garner, author of the book Clojure for Data Science, we'll be working with a relatively modest dataset of only 100,000 records. This isn't big data (at 100 MB, it will fit comfortably in the memory of one machine), but it's large enough to demonstrate the common techniques of large-scale data processing. Using Hadoop (the popular framework for distributed computation) as its case study, this article will focus on how to scale algorithms to very large volumes of data through parallelism. Before we get to Hadoop and distributed data processing though, we'll see how some of the same principles that enable Hadoop to be effective at a very large scale can also be applied to data processing on a single machine, by taking advantage of the parallel capacity available in all modern computers. (For more resources related to this topic, see here.) The reducers library The count operation we implemented previously is a sequential algorithm. Each line is processed one at a time until the sequence is exhausted. But there is nothing about the operation that demands that it must be done in this way. We could split the number of lines into two sequences (ideally of roughly equal length) and reduce over each sequence independently. When we're done, we would just add together the total number of lines from each sequence to get the total number of lines in the file: If each Reduce ran on its own processing unit, then the two count operations would run in parallel. All the other things being equal, the algorithm would run twice as fast. This is one of the aims of the clojure.core.reducers library—to bring the benefit of parallelism to algorithms implemented on a single machine by taking advantage of multiple cores. Parallel folds with reducers The parallel implementation of reduce implemented by the reducers library is called fold. To make use of a fold, we have to supply a combiner function that will take the results of our reduced sequences (the partial row counts) and return the final result. Since our row counts are numbers, the combiner function is simply +. Reducers are a part of Clojure's standard library, they do not need to be added as an external dependency. The adjusted example, using clojure.core.reducers as r, looks like this: (defn ex-5-5 [] (->> (io/reader "data/soi.csv") (line-seq) (r/fold + (fn [i x] (inc i))))) The combiner function, +, has been included as the first argument to fold and our unchanged reduce function is supplied as the second argument. We no longer need to pass the initial value of zero—fold will get the initial value by calling the combiner function with no arguments. Our preceding example works because +, called with no arguments, already returns zero: (defn ex-5-6 [] (+)) ;; 0 To participate in folding then, it's important that the combiner function have two implementations: one with zero arguments that returns the identity value and another with two arguments that combines the arguments. Different folds will, of course, require different combiner functions and identity values. For example, the identity value for multiplication is 1. We can visualize the process of seeding the computation with an identity value, iteratively reducing over the sequence of xs and combining the reductions into an output value as a tree: There may be more than two reductions to combine, of course. The default implementation of fold will split the input collection into chunks of 512 elements. Our 166,000-element sequence will therefore generate 325 reductions to be combined. We're going to run out of page real estate quite quickly with a tree representation diagram, so let's visualize the process more schematically instead—as a two-step reduce and combine process. The first step performs a parallel reduce across all the chunks in the collection. The second step performs a serial reduce over the intermediate results to arrive at the final result: The preceding representation shows reduce over several sequences of xs, represented here as circles, into a series of outputs, represented here as squares. The squares are combined serially to produce the final result, represented by a star. Loading large files with iota Calling fold on a lazy sequence requires Clojure to realize the sequence into memory and then chunk the sequence into groups for parallel execution. For situations where the calculation performed on each row is small, the overhead involved in coordination outweighs the benefit of parallelism. We can improve the situation slightly by using a library called iota (https://github.com/thebusby/iota). The iota library loads files directly into the data structures suitable for folding over with reducers that can handle files larger than available memory by making use of memory-mapped files. With iota in the place of our line-seq function, our line count simply becomes: (defn ex-5-7 [] (->> (iota/seq "data/soi.csv") (r/fold + (fn [i x] (inc i))))) So far, we've just been working with the sequences of unformatted lines, but if we're going to do anything more than counting the rows, we'll want to parse them into a more useful data structure. This is another area in which Clojure's reducers can help make our code more efficient. Creating a reducers processing pipeline We already know that the file is comma-separated, so let's first create a function to turn each row into a vector of fields. All fields except the first two contain numeric data, so let's parse them into doubles while we're at it: (defn parse-double [x] (Double/parseDouble x)) (defn parse-line [line] (let [[text-fields double-fields] (->> (str/split line #",") (split-at 2))] (concat text-fields (map parse-double double-fields)))) We're using the reducers version of map to apply our parse-line function to each of the lines from the file in turn: (defn ex-5-8 [] (->> (iota/seq "data/soi.csv") (r/drop 1) (r/map parse-line) (r/take 1) (into []))) ;; [("01" "AL" 0.0 1.0 889920.0 490850.0 ...)] The final into function call converts the reducers' internal representation (a reducible collection) into a Clojure vector. The previous example should return a sequence of 77 fields, representing the first row of the file after the header. We're just dropping the column names at the moment, but it would be great if we could make use of these to return a map representation of each record, associating the column name with the field value. The keys of the map would be the column headings and the values would be the parsed fields. The clojure.core function zipmap will create a map out of two sequences—one for the keys and one for the values: (defn parse-columns [line] (->> (str/split line #",") (map keyword))) (defn ex-5-9 [] (let [data (iota/seq "data/soi.csv") column-names (parse-columns (first data))] (->> (r/drop 1 data) (r/map parse-line) (r/map (fn [fields] (zipmap column-names fields))) (r/take 1) (into [])))) This function returns a map representation of each row, a much more user-friendly data structure: [{:N2 1505430.0, :A19300 181519.0, :MARS4 256900.0 ...}] A great thing about Clojure's reducers is that in the preceding computation, calls to r/map, r/drop and r/take are composed into a reduction that will be performed in a single pass over the data. This becomes particularly valuable as the number of operations increases. Let's assume that we'd like to filter out zero ZIP codes. We could extend the reducers pipeline like this: (defn ex-5-10 [] (let [data (iota/seq "data/soi.csv") column-names (parse-columns (first data))] (->> (r/drop 1 data) (r/map parse-line) (r/map (fn [fields] (zipmap column-names fields))) (r/remove (fn [record] (zero? (:zipcode record)))) (r/take 1) (into [])))) The r/remove step is now also being run together with the r/map, r/drop and r/take calls. As the size of the data increases, it becomes increasingly important to avoid making multiple iterations over the data unnecessarily. Using Clojure's reducers ensures that our calculations are compiled into a single pass. Curried reductions with reducers To make the process clearer, we can create a curried version of each of our previous steps. To parse the lines, create a record from the fields and filter zero ZIP codes. The curried version of the function is a reduction waiting for a collection: (def line-formatter (r/map parse-line)) (defn record-formatter [column-names] (r/map (fn [fields] (zipmap column-names fields)))) (def remove-zero-zip (r/remove (fn [record] (zero? (:zipcode record))))) In each case, we're calling one of reducers' functions, but without providing a collection. The response is a curried version of the function that can be applied to the collection at a later time. The curried functions can be composed together into a single parse-file function using comp: (defn load-data [file] (let [data (iota/seq file) col-names (parse-columns (first data)) parse-file (comp remove-zero-zip (record-formatter col-names) line-formatter)] (parse-file (rest data)))) It's only when the parse-file function is called with a sequence that the pipeline is actually executed. Statistical folds with reducers With the data parsed, it's time to perform some descriptive statistics. Let's assume that we'd like to know the mean number of returns (column N1) submitted to the IRS by ZIP code. One way of doing this—the way we've done several times throughout the book—is by adding up the values and dividing it by the count. Our first attempt might look like this: (defn ex-5-11 [] (let [data (load-data "data/soi.csv") xs (into [] (r/map :N1 data))] (/ (reduce + xs) (count xs)))) ;; 853.37 While this works, it's comparatively slow. We iterate over the data once to create xs, a second time to calculate the sum, and a third time to calculate the count. The bigger our dataset gets, the larger the time penalty we'll pay. Ideally, we would be able to calculate the mean value in a single pass over the data, just like our parse-file function previously. It would be even better if we can perform it in parallel too. Associativity Before we proceed, it's useful to take a moment to reflect on why the following code wouldn't do what we want: (defn mean ([] 0) ([x y] (/ (+ x y) 2))) Our mean function is a function of two arities. Without arguments, it returns zero, the identity for the mean computation. With two arguments, it returns their mean: (defn ex-5-12 [] (->> (load-data "data/soi.csv") (r/map :N1) (r/fold mean))) ;; 930.54 The preceding example folds over the N1 data with our mean function and produces a different result from the one we obtained previously. If we could expand out the computation for the first three xs, we might see something like the following code: (mean (mean (mean 0 a) b) c) This is a bad idea, because the mean function is not associative. For an associative function, the following holds true: Addition is associative, but multiplication and division are not. So the mean function is not associative either. Contrast the mean function with the following simple addition: (+ 1 (+ 2 3)) This yields an identical result to: (+ (+ 1 2) 3) It doesn't matter how the arguments to + are partitioned. Associativity is an important property of functions used to reduce over a set of data because, by definition, the results of a previous calculation are treated as inputs to the next. The easiest way of converting the mean function into an associative function is to calculate the sum and the count separately. Since the sum and the count are associative, they can be calculated in parallel over the data. The mean function can be calculated simply by dividing one by the other. Multiple regression with gradient descent The normal equation uses matrix algebra to very quickly and efficiently arrive at the least squares estimates. Where all data fits in memory, this is a very convenient and concise equation. Where the data exceeds the memory available to a single machine however, the calculation becomes unwieldy. The reason for this is matrix inversion. The calculation of is not something that can be accomplished on a fold over the data—each cell in the output matrix depends on many others in the input matrix. These complex relationships require that the matrix be processed in a nonsequential way. An alternative approach to solve linear regression problems, and many other related machine learning problems, is a technique called gradient descent. Gradient descent reframes the problem as the solution to an iterative algorithm—one that does not calculate the answer in one very computationally intensive step, but rather converges towards the correct answer over a series of much smaller steps. The gradient descent update rule Gradient descent works by the iterative application of a function that moves the parameters in the direction of their optimum values. To apply this function, we need to know the gradient of the cost function with the current parameters. Calculating the formula for the gradient involves calculus that's beyond the scope of this book. Fortunately, the resulting formula isn't terribly difficult to interpret: is the partial derivative, or the gradient, of our cost function J(β) for the parameter at index j. Therefore, we can see that the gradient of the cost function with respect to the parameter at index j is equal to the difference between our prediction and the true value of y multiplied by the value of x at index j. Since we're seeking to descend the gradient, we want to subtract some proportion of the gradient from the current parameter values. Thus, at each step of gradient descent, we perform the following update: Here, := is the assigment operator and α is a factor called the learning rate. The learning rate controls how large an adjustment we wish make to the parameters at each iteration as a fraction of the gradient. If our prediction ŷ nearly matches the actual value of y, then there would be little need to change the parameters. In contrast, a larger error will result in a larger adjustment to the parameters. This rule is called the Widrow-Hoff learning rule or the Delta rule. The gradient descent learning rate As we've seen, gradient descent is an iterative algorithm. The learning rate, usually represented by α, dictates the speed at which the gradient descent converges to the final answer. If the learning rate is too small, convergence will happen very slowly. If it is too large, gradient descent will not find values close to the optimum and may even diverge from the correct answer: In the preceding chart, a small learning rate leads to a show convergence over many iterations of the algorithm. While the algorithm does reach the minimum, it does so over many more steps than is ideal and, therefore, may take considerable time. By contrast, in following diagram, we can see the effect of a learning rate that is too large. The parameter estimates are changed so significantly between iterations that they actually overshoot the optimum values and diverge from the minimum value: The gradient descent algorithm requires us to iterate repeatedly over our dataset. With the correct version of alpha, each iteration should successively yield better approximations of the ideal parameters. We can choose to terminate the algorithm when either the change between iterations is very small or after a predetermined number of iterations. Feature scaling As more features are added to the linear model, it is important to scale features appropriately. Gradient descent will not perform very well if the features have radically different scales, since it won't be possible to pick a learning rate to suit them all. A simple scaling we can perform is to subtract the mean value from each of the values and divide it by the standard-deviation. This will tend to produce values with zero mean that generally vary between -3 and 3: ( defn feature-scales [features] (->> (prepare-data) (t/map #(select-keys % features)) (t/facet) (t/fuse {:mean (m/mean) :sd (m/standard-deviation)}))) The feature-factors function in the preceding code uses t/facet to calculate the mean value and standard deviation of all the input features: (defn ex-5-24 [] (let [data (iota/seq "data/soi.csv") features [:A02300 :A00200 :AGI_STUB :NUMDEP :MARS2]] (->> (feature-scales features) (t/tesser (chunks data))))) ;; {:MARS2 {:sd 533.4496892658647, :mean 317.0412009748016}...} If you run the preceding example, you'll see the different means and standard deviations returned by the feature-scales function. Since our feature scales and input records are represented as maps, we can perform the scale across all the features at once using Clojure's merge-with function: (defn scale-features [factors] (let [f (fn [x {:keys [mean sd]}] (/ (- x mean) sd))] (fn [x] (merge-with f x factors)))) Likewise, we can perform the all-important reversal with unscale-features: (defn unscale-features [factors] (let [f (fn [x {:keys [mean sd]}] (+ (* x sd) mean))] (fn [x] (merge-with f x factors)))) Let's scale our features and take a look at the very first feature. Tesser won't allow us to execute a fold without a reduce, so we'll temporarily revert to using Clojure's reducers: (defn ex-5-25 [] (let [data (iota/seq "data/soi.csv") features [:A02300 :A00200 :AGI_STUB :NUMDEP :MARS2] factors (->> (feature-scales features) (t/tesser (chunks data)))] (->> (load-data "data/soi.csv") (r/map #(select-keys % features )) (r/map (scale-features factors)) (into []) (first)))) ;; {:MARS2 -0.14837567114357617, :NUMDEP 0.30617757526890155, ;; :AGI_STUB -0.714280814223704, :A00200 -0.5894942801950217, ;; :A02300 0.031741856083514465} This simple step will help gradient descent perform optimally on our data. Feature extraction Although we've used maps to represent our input data in this article, it's going to be more convenient when running gradient descent to represent our features as a matrix. Let's write a function to transform our input data into a map of xs and y. The y axis will be a scalar response value and xs will be a matrix of scaled feature values. We're adding a bias term to the returned matrix of features: (defn feature-matrix [record features] (let [xs (map #(% record) features)] (i/matrix (cons 1 xs)))) (defn extract-features [fy features] (fn [record] {:y (fy record) :xs (feature-matrix record features)})) Our feature-matrix function simply accepts an input of a record and the features to convert into a matrix. We call this from within extract-features, which returns a function that we can call on each input record: (defn ex-5-26 [] (let [data (iota/seq "data/soi.csv") features [:A02300 :A00200 :AGI_STUB :NUMDEP :MARS2] factors (->> (feature-scales features) (t/tesser (chunks data)))] (->> (load-data "data/soi.csv") (r/map (scale-features factors)) (r/map (extract-features :A02300 features)) (into []) (first)))) ;; {:y 433.0, :xs A 5x1 matrix ;; ------------- ;; 1.00e+00 ;; -5.89e-01 ;; -7.14e-01 ;; 3.06e-01 ;; -1.48e-01 ;; } The preceding example shows the data converted into a format suitable to perform gradient descent: a map containing the y response variable and a matrix of values, including the bias term. Applying a single step of gradient descent The objective of calculating the cost is to determine the amount by which to adjust each of the coefficients. Once we've calculated the average cost, as we did previously, we need to update the estimate of our coefficients β. Together, these steps represent a single iteration of gradient descent: We can return the updated coefficients in a post-combiner step that makes use of the average cost, the value of alpha, and the previous coefficients. Let's create a utility function update-coefficients, which will receive the coefficients and alpha and return a function that will calculate the new coefficients, given a total model cost: (defn update-coefficients [coefs alpha] (fn [cost] (->> (i/mult cost alpha) (i/minus coefs)))) With the preceding function in place, we have everything we need to package up a batch gradient descent update rule: (defn gradient-descent-fold [{:keys [fy features factors coefs alpha]}] (let [zeros-matrix (i/matrix 0 (count features) 1)] (->> (prepare-data) (t/map (scale-features factors)) (t/map (extract-features fy features)) (t/map (calculate-error (i/trans coefs))) (t/fold (matrix-mean (inc (count features)) 1)) (t/post-combine (update-coefficients coefs alpha))))) (defn ex-5-31 [] (let [features [:A00200 :AGI_STUB :NUMDEP :MARS2] fcount (inc (count features)) coefs (vec (replicate fcount 0)) data (chunks (iota/seq "data/soi.csv")) factors (->> (feature-scales features) (t/tesser data)) options {:fy :A02300 :features features :factors factors :coefs coefs :alpha 0.1}] (->> (gradient-descent-fold options) (t/tesser data)))) ;; A 6x1 matrix ;; ------------- ;; -4.20e+02 ;; -1.38e+06 ;; -5.06e+07 ;; -9.53e+02 ;; -1.42e+06 ;; -4.86e+05 The resulting matrix represents the values of the coefficients after the first iteration of gradient descent. Running iterative gradient descent Gradient descent is an iterative algorithm, and we will usually need to run it many times to convergence. With a large dataset, this can be very time-consuming. To save time, we've included a random sample of soi.csv in the data directory called soi-sample.csv. The smaller size allows us to run iterative gradient descent in a reasonable timescale. The following code runs gradient descent for 100 iterations, plotting the values of the parameters between each iteration on an xy-plot: (defn descend [options data] (fn [coefs] (->> (gradient-descent-fold (assoc options :coefs coefs)) (t/tesser data)))) (defn ex-5-32 [] (let [features [:A00200 :AGI_STUB :NUMDEP :MARS2] fcount (inc (count features)) coefs (vec (replicate fcount 0)) data (chunks (iota/seq "data/soi-sample.csv")) factors (->> (feature-scales features) (t/tesser data)) options {:fy :A02300 :features features :factors factors :coefs coefs :alpha 0.1} iterations 100 xs (range iterations) ys (->> (iterate (descend options data) coefs) (take iterations))] (-> (c/xy-plot xs (map first ys) :x-label "Iterations" :y-label "Coefficient") (c/add-lines xs (map second ys)) (c/add-lines xs (map #(nth % 2) ys)) (c/add-lines xs (map #(nth % 3) ys)) (c/add-lines xs (map #(nth % 4) ys)) (i/view)))) If you run the example, you should see a chart similar to the following: In the preceding chart, you can see how the parameters converge to relatively stable the values over the course of 100 iterations. Scaling gradient descent with Hadoop The length of time each iteration of batch gradient descent takes to run is determined by the size of your data and by how many processors your computer has. Although several chunks of data are processed in parallel, the dataset is large and the processors are finite. We've achieved a speed gain by performing calculations in parallel, but if we double the size of the dataset, the runtime will double as well. Hadoop is one of several systems that has emerged in the last decade which aims to parallelize work that exceeds the capabilities of a single machine. Rather than running code across multiple processors, Hadoop takes care of running a calculation across many servers. In fact, Hadoop clusters can, and some do, consist of many thousands of servers. Hadoop consists of two primary subsystems— the Hadoop Distributed File System (HDFS)—and the job processing system, MapReduce. HDFS stores files in chunks. A given file may be composed of many chunks and chunks are often replicated across many servers. In this way, Hadoop can store quantities of data much too large for any single server and, through replication, ensure that the data is stored reliably in the event of hardware failure too. As the name implies, the MapReduce programming model is built around the concept of map and reduce steps. Each job is composed of at least one map step and may optionally specify a reduce step. An entire job may consist of several map and reduce steps chained together. In the respect that reduce steps are optional, Hadoop has a slightly more flexible approach to distributed calculation than Tesser. Gradient descent on Hadoop with Tesser and Parkour Tesser's Hadoop capabilities are available in the tesser.hadoop namespace, which we're including as h. The primary public API function in the Hadoop namespace is h/fold. The fold function expects to receive at least four arguments, representing the configuration of the Hadoop job, the input file we want to process, a working directory for Hadoop to store its intermediate files, and the fold we want to run, referenced as a Clojure var. Any additional arguments supplied will be passed as arguments to the fold when it is executed. The reason for using a var to represent our fold is that the function call initiating the fold may happen on a completely different computer than the one that actually executes it. In a distributed setting, the var and arguments must entirely specify the behavior of the function. We can't, in general, rely on other mutable local state (for example, the value of an atom, or the value of variables closing over the function) to provide any additional context. Parkour distributed sources and sinks The data which we want our Hadoop job to process may exist on multiple machines too, stored distributed in chunks on HDFS. Tesser makes use of a library called Parkour (https://github.com/damballa/parkour/) to handle accessing potentially distributed data sources. Although Hadoop is designed to be run and distributed across many servers, it can also run in local mode. Local mode is suitable for testing and enables us to interact with the local filesystem as if it were HDFS. Another namespace we'll be using from Parkour is the parkour.conf namespace. This will allow us to create a default Hadoop configuration and operate it in local mode: (defn ex-5-33 [] (->> (text/dseq "data/soi.csv") (r/take 2) (into []))) In the preceding example, we use Parkour's text/dseq function to create a representation of the IRS input data. The return value implements Clojure's reducers protocol, so we can use r/take on the result. Running a feature scale fold with Hadoop Hadoop needs a location to write its temporary files while working on a task, and will complain if we try to overwrite an existing directory. Since we'll be executing several jobs over the course of the next few examples, let's create a little utility function that returns a new file with a randomly-generated name. (defn rand-file [path] (io/file path (str (long (rand 0x100000000))))) (defn ex-5-34 [] (let [conf (conf/ig) input (text/dseq "data/soi.csv") workdir (rand-file "tmp") features [:A00200 :AGI_STUB :NUMDEP :MARS2]] (h/fold conf input workdir #'feature-scales features))) Parkour provides a default Hadoop configuration object with the shorthand (conf/ig). This will return an empty configuration. The default value is enough, we don't need to supply any custom configuration. All of our Hadoop jobs will write their temporary files to a random directory inside the project's tmp directory. Remember to delete this folder later, if you're concerned about preserving disk space. If you run the preceding example now, you should get an output similar to the following: ;; {:MARS2 317.0412009748016, :NUMDEP 581.8504423822615, ;; :AGI_STUB 3.499939975269811, :A00200 37290.58880658831} Although the return value is identical to the values we got previously, we're now making use of Hadoop behind the scenes to process our data. In spite of this, notice that Tesser will return the response from our fold as a single Clojure data structure. Running gradient descent with Hadoop Since tesser.hadoop folds return Clojure data structures just like tesser.core folds, defining a gradient descent function that makes use of our scaled features is very simple: (defn hadoop-gradient-descent [conf input-file workdir] (let [features [:A00200 :AGI_STUB :NUMDEP :MARS2] fcount (inc (count features)) coefs (vec (replicate fcount 0)) input (text/dseq input-file) options {:column-names column-names :features features :coefs coefs :fy :A02300 :alpha 1e-3} factors (h/fold conf input (rand-file workdir) #'feature-scales features) descend (fn [coefs] (h/fold conf input (rand-file workdir) #'gradient-descent-fold (merge options {:coefs coefs :factors factors})))] (take 5 (iterate descend coefs)))) The preceding code defines a hadoop-gradient-descent function that iterates a descend function 5 times. Each iteration of descend calculates the improved coefficients based on the gradient-descent-fold function. The final return value is a vector of coefficients after 5 iterations of a gradient descent. We run the job on the full IRS data in the following example: ( defn ex-5-35 [] (let [workdir "tmp" out-file (rand-file workdir)] (hadoop-gradient-descent (conf/ig) "data/soi.csv" workdir))) After several iterations, you should see an output similar to the following: ;; ([0 0 0 0 0] ;; (20.9839310796048 46.87214911003046 -7.363493937722712 ;; 101.46736841329326 55.67860863427868) ;; (40.918665605227744 56.55169901254631 -13.771345753228694 ;; 162.1908841131747 81.23969785586247) ;; (59.85666340457121 50.559130068258995 -19.463888245285332 ;; 202.32407094149158 92.77424653758085) ;; (77.8477613139478 38.67088624825574 -24.585818946408523 ;; 231.42399118694212 97.75201693843269)) We've seen how we're able to calculate gradient descent using distributed techniques locally. Now, let's see how we can run this on a cluster of our own. Summary In this article, we learned some of the fundamental techniques of distributed data processing and saw how the functions used locally for data processing, map and reduce, are powerful ways of processing even very large quantities of data. We learned how Hadoop can scale unbounded by the capabilities of any single server by running functions on smaller subsets of the data whose outputs are themselves combined to finally produce a result. Once you understand the tradeoffs, this "divide and conquer" approach toward processing data is a simple and very general way of analyzing data on a large scale. We saw both the power and limitations of simple folds to process data using both Clojure's reducers and Tesser. We've also begun exploring how Parkour exposes more of Hadoop's underlying capabilities. Resources for Article: Further resources on this subject: Supervised learning[article] Machine Learning[article] Why Big Data in the Financial Sector? [article]

0
0
10892

article-image-rotation-forest-classifier-ensemble-based-feature-extraction

Packt

28 Oct 2015

16 min read

Rotation Forest - A Classifier Ensemble Based on Feature Extraction

Packt

28 Oct 2015

16 min read

In this article by Gopi Subramanian author of the book Python Data Science Cookbook you will learn bagging methods based on decision tree-based algorithms are very popular among the data science community. Rotation Forest The claim to fame of most of these methods is that they need zero data preparation as compared to the other methods, can obtain very good results, and can be provided as a black box of tools in the hands of software engineers. By design, bagging lends itself nicely to parallelization. Hence, these methods can be easily applied on a very large dataset in a cluster environment. The decision tree algorithms split the input data into various regions at each level of the tree. Thus, they perform implicit feature selection. Feature selection is one of the most important tasks in building a good model. By providing implicit feature selection, trees are at an advantageous position as compared to other techniques. Hence, bagging with decision trees comes with this advantage. Almost no data preparation is needed for decision trees. For example, consider the scaling of attributes. Attribute scaling has no impact on the structure of the trees. The missing values also do not affect decision trees. The effect of outliers is very minimal on a decision tree. We don’t have to do explicit feature transformations to accommodate feature interactions. One of the major complaints against tree-based methods is the difficulty with pruning the trees to avoid overfitting. Big trees tend to also fit the noise present in the underlying data and hence lead to a low bias and high variance. However, when we grow a lot of trees and the final prediction is an average of the output of all the trees in the ensemble, we avoid these problems. In this article, we will see a powerful tree-based ensemble method called rotation forest. A typical random forest requires a large number of trees to be a part of its ensemble in order to achieve good performance. Rotation forest can achieve similar or better performance with less number of trees. Additionally, the authors of this algorithm claim that the underlying estimator can be anything other than a tree. This way, it is projected as a new framework to build an ensemble similar to gradient boosting. (For more resources related to this topic, see here.) The algorithm Random forest and bagging gives impressive results with very large ensembles; having a large number of estimators results in the improvement of the accuracy of these methods. On the contrary, rotation forest is designed to work with a smaller number of ensembles. Let's write down the steps involved in building a rotation forest. The number of trees required in the forest is typically specified by the user. Let T be the number of trees required to be built. We start with iterating from one through T, that is, we build T trees. For each tree T, perform the following steps: Split the attributes in the training set into K nonoverlapping subsets of equal size. We have K datasets, each with K attributes. For each of the K datasets, we proceed to do the following: Bootstrap 75% of the data from each K dataset and use the bootstrapped sample for further steps. Run a principal component analysis on the ith subset in K. Retain all the principal components. For every feature j in the Kth subset, we have a principle component, a. Let's denote it as aij, where it’s the principal component for the jth attribute in the ith subset. Store the principal components for the subset. Create a rotation matrix of size, n X n, where n is the total number of attributes. Arrange the principal component in the matrix such that the components match the position of the feature in the original training dataset. Project the training dataset on the rotation matrix using the matrix multiplication. Build a decision tree with the projected dataset. Store the tree and rotation matrix. A quick note about PCA: PCA is an unsupervised method. In multivariate problems, PCA is used to reduce the dimension of the data with minimal information loss or, in other words, retain maximum variation in the data. In PCA, we find the directions in the data with the most variation, that is, the eigenvectors corresponding to the largest eigenvalues of the covariance matrix and project the data onto these directions. With a dataset (n x m) with n instances and m dimensions, PCA projects it onto a smaller subspace (n x d), where d << m. A point to note is that PCA is computationally very expensive. Programming rotation forest in Python Now let's write a Python code to implement rotation forest. We will proceed to test it on a classification problem. We will generate some classification dataset to demonstrate rotation forest. To our knowledge, there is no Python implementation available for rotation forest. We will leverage scikit-learns’s implementation of the decision tree classifier and use the train_test_split method for the bootstrapping. Refer to the following link to learn more about scikit-learn: http://scikit-learn.org/stable/ First write the necessary code to implement rotation forest and apply it on a classification problem. We will start with loading all the necessary libraries. Let's leverage the make_classification method from the sklearn.dataset module to generate the training data. We will follow it with a method to select a random subset of attributes called gen_random_subset: from sklearn.datasets import make_classification from sklearn.metrics import classification_report from sklearn.cross_validation import train_test_split from sklearn.decomposition import PCA from sklearn.tree import DecisionTreeClassifier import numpy as np def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 50 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y def get_random_subset(iterable,k): subsets = [] iteration = 0 np.random.shuffle(iterable) subset = 0 limit = len(iterable)/k while iteration < limit: if k <= len(iterable): subset = k else: subset = len(iterable) subsets.append(iterable[-subset:]) del iterable[-subset:] iteration+=1 return subsets We will write the build_rotationtree_model function, where we will build fully-grown trees and proceed to evaluate the forest’s performance using the model_worth function: def build_rotationtree_model(x_train,y_train,d,k): models = [] r_matrices = [] feature_subsets = [] for i in range(d): x,_,_,_ = train_test_split(x_train,y_train,test_size=0.3,random_state=7) # Features ids feature_index = range(x.shape[1]) # Get subsets of features random_k_subset = get_random_subset(feature_index,k) feature_subsets.append(random_k_subset) # Rotation matrix R_matrix = np.zeros((x.shape[1],x.shape[1]),dtype=float) for each_subset in random_k_subset: pca = PCA() x_subset = x[:,each_subset] pca.fit(x_subset) for ii in range(0,len(pca.components_)): for jj in range(0,len(pca.components_)): R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] x_transformed = x_train.dot(R_matrix) model = DecisionTreeClassifier() model.fit(x_transformed,y_train) models.append(model) r_matrices.append(R_matrix) return models,r_matrices,feature_subsets def model_worth(models,r_matrices,x,y): predicted_ys = [] for i,model in enumerate(models): x_mod = x.dot(r_matrices[i]) predicted_y = model.predict(x_mod) predicted_ys.append(predicted_y) predicted_matrix = np.asmatrix(predicted_ys) final_prediction = [] for i in range(len(y)): pred_from_all_models = np.ravel(predicted_matrix[:,i]) non_zero_pred = np.nonzero(pred_from_all_models)[0] is_one = len(non_zero_pred) > len(models)/2 final_prediction.append(is_one) print classification_report(y, final_prediction) Finally, we will write a main function used to invoke the functions that we have defined: if __name__ == "__main__": x,y = get_data() # plot_data(x,y) # Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9) x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9) # Build a bag of models models,r_matrices,features = build_rotationtree_model(x_train,y_train,25,5) model_worth(models,r_matrices,x_train,y_train) model_worth(models,r_matrices,x_dev,y_dev) Walking through our code Let's start with our main function. We will invoke get_data to get our predictor attributes in the response attributes. In get_data, we will leverage the make_classification dataset to generate our training data for our recipe: def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ no_features = 30 redundant_features = int(0.1*no_features) informative_features = int(0.6*no_features) repeated_features = int(0.1*no_features) x,y = make_classification(n_samples=500,n_features=no_features,flip_y=0.03, n_informative = informative_features, n_redundant = redundant_features ,n_repeated = repeated_features,random_state=7) return x,y Let's look at the parameters passed to the make_classification method. The first parameter is the number of instances required; in this case, we say we need 500 instances. The second parameter is about how many attributes per instance are required. We say that we need 30. The third parameter, flip_y, randomly interchanges 3% of the instances. This is done to introduce some noise in our data. The next parameter is how many of these 30 features should be informative enough to be used in our classification. We have specified that 60% of our features, that is, 18 out of 30 should be informative. The next parameter is about redundant features. These are generated as a linear combination of the informative features in order to introduce correlation among the features. Finally, the repeated features are duplicate features, which are drawn randomly from both the informative and redundant features. Let's split the data into training and testing sets using train_test_split. We will reserve 30% of our data to test: Divide the data into Train, dev and test x_train,x_test_all,y_train,y_test_all = train_test_split(x,y,test_size = 0.3,random_state=9) Once again, we will leverage train_test_split to split our test data into dev and test sets: x_dev,x_test,y_dev,y_test = train_test_split(x_test_all,y_test_all,test_size=0.3,random_state=9) With the data divided to build, evaluate, and test the model, we will proceed to build our models: models,r_matrices,features = build_rotationtree_model(x_train,y_train,25,5) We will invoke the build_rotationtree_model function to build our rotation forest. We will pass our training data, predictor, x_train and response variable, y_train, the total number of trees to be built—25 in this case—and finally, the subset of features to be used—5 in this case. Let's jump into this function: models = [] r_matrices = [] feature_subsets = [] We will begin with declaring three lists to store each of the decision tree, rotation matrix for this tree, and subset of features used in this iteration. We will proceed to build each tree in our ensemble. As a first order of business, we will bootstrap to retain only 75% of the data: x,_,_,_ = train_test_split(x_train,y_train,test_size=0.3,random_state=7) We will leverage the train_test_split function from scikit-learn for the bootstrapping. We will then decide the feature subsets: # Features ids feature_index = range(x.shape[1]) # Get subsets of features random_k_subset = get_random_subset(feature_index,k) feature_subsets.append(random_k_subset) The get_random_subset function takes the feature index and number of subsets required as parameters and returns K subsets. In this function, we will shuffle the feature index. The feature index is an array of numbers starting from zero and ending with the number of features in our training set: np.random.shuffle(iterable) Let's say that we have ten features and our k value is five, indicating that we need subsets with five nonoverlapping feature indices and we need to do two iterations. We will store the number of iterations needed in the limit variable: limit = len(iterable)/k while iteration < limit: if k <= len(iterable): subset = k else: subset = len(iterable) iteration+=1 If our required subset is less than the total number of attributes, we can proceed to use the first k entries in our iterable. As we shuffled our iterables, we will be returning different volumes at different times: subsets.append(iterable[-subset:]) On selecting a subset, we will remove it from the iterable as we need nonoverlapping sets: del iterable[-subset:] With all the subsets ready, we will declare our rotation matrix: del iterable[-subset:] With all the subsets ready, we will declare our rotation matrix: # Rotation matrix R_matrix = np.zeros((x.shape[1],x.shape[1]),dtype=float) As you can see, our rotation matrix is of size, n x n, where is the number of attributes in our dataset. You can see that we have used the shape attribute to declare this matrix filled with zeros: for each_subset in random_k_subset: pca = PCA() x_subset = x[:,each_subset] pca.fit(x_subset) For each of the K subsets of data having only K features, we will proceed to do a principal component analysis. We will fill our rotation matrix with the component values: for ii in range(0,len(pca.components_)): for jj in range(0,len(pca.components_)): R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] For example, let's say that we have three attributes in our subset in a total of six attributes. For illustration, let's say that our subsets are as follows: 2,4,6 and 1,3,5 Our rotation matrix, R, is of size, 6 x 6. Assume that we want to fill the rotation matrix for the first subset of features. We will have three principal components, one each for 2,4, and 6 of size, 1 x 3. The output of PCA from scikit-learn is a matrix of size components X features. We will go through each component value in the for loop. At the first run, our feature of interest is two, and the cell (0,0) in the component matrix output from PCA gives the value of contribution of feature two to component one. We have to find the right place in the rotation matrix for this value. We will use the index from the component matrices, ii and jj, with the subset list to get the right place in the rotation matrix: R_matrix[each_subset[ii],each_subset[jj]] = pca.components_[ii,jj] The each_subset[0] and each_subset[0] will put us in cell (2,2) in the rotation matrix. As we go through the loop, the next component value in cell (0,1) in the component matrix will be placed in cell (2,4) in the rotation matrix and the last one in cell (2,6) in the rotation matrix. This is done for all the attributes in the first subset. Let's go to the second subset; here the first attribute is one. The cell (0,0) of the component matrix corresponds to the cell (1,1) in the rotation matrix. Proceeding this way, you will notice that the attribute component values are arranged in the same order as the attributes themselves. With our rotation matrix ready, let's project our input onto the rotation matrix: x_transformed = x_train.dot(R_matrix) It’s time now to fit our decision tree: model = DecisionTreeClassifier() model.fit(x_transformed,y_train) Finally, we will store our models and the corresponding rotation matrices: models.append(model) r_matrices.append(R_matrix) With our model built, let's proceed to see how good our model is in both the train and dev datasets using the model_worth function: model_worth(models,r_matrices,x_train,y_train) model_worth(models,r_matrices,x_dev,y_dev) Let's see our model_worth function: for i,model in enumerate(models): x_mod = x.dot(r_matrices[i]) predicted_y = model.predict(x_mod) predicted_ys.append(predicted_y) In this function, perform prediction using each of the trees that we built. However, before doing the prediction, we will project our input using the rotation matrix. We will store all our prediction output in a list called predicted_ys. Let's say that we have 100 instances to predict and ten models in our tree; for each instance, we have ten predictions. We will store these as a matrix for convenience: predicted_matrix = np.asmatrix(predicted_ys) Now, we will proceed to give a final classification for each of our input records: final_prediction = [] for i in range(len(y)): pred_from_all_models = np.ravel(predicted_matrix[:,i]) non_zero_pred = np.nonzero(pred_from_all_models)[0] is_one = len(non_zero_pred) > len(models)/2 final_prediction.append(is_one) We will store our final prediction in a list called final_prediction. We will go through each of the predictions for our instance. Let's say that we are in the first instance (i=0 in our for loop); pred_from_all_models stores the output from all the trees in our model. It’s an array of zeros and ones indicating which class has the model classified in this instance. We will make another array out of it, non_zero_pred, which has only those entries from the parent arrays that are non-zero. Finally, if the length of this non-zero array is greater than half the number of models that we have, we say that our final prediction is one for the instance of interest. What we have accomplished here is the classic voting scheme. Let's look at how good our models are now by calling a classification report: print classification_report(y, final_prediction) Here is how good our model has performed on the training set: Let's see how good our model performance is on the dev dataset: References More information about rotation forest can be obtained from the following paper:. Rotation Forest: A New Classifier Ensemble Method Juan J. Rodrı´guez, Member, IEEE Computer Society, Ludmila I. Kuncheva, Member, IEEE, and Carlos J. Alonso The paper also claims that when rotation forest was compared to bagging, AdBoost, and random forest on 33 datasets, rotation forest outperformed all the other three algorithms. Similar to gradient boosting, authors of the paper claim that rotation forest is an overall framework and the underlying ensemble is not necessary to be a decision tree. Work is in progress in testing other algorithms such as Naïve Bayes, Neural Networks, and similar others. Summary In this article will learnt the bagging methods based on decision tree-based algorithms are very popular among the data science community. Resources for Article: Further resources on this subject: Mobile Phone Forensics – A First Step into Android Forensics [article] Develop a Digital Clock [article] Monitoring Physical Network Bandwidth Using OpenStack Ceilometer [article]

0
0
10885

Packt

18 Nov 2014

17 min read

The plot function

Packt

18 Nov 2014

17 min read

0
0
10842

Packt

06 Sep 2013

4 min read

Regression models in Weka

Packt

06 Sep 2013

4 min read

(For more resources related to this topic, see here.) Getting ready Let's look at an example of a house price-based regression model, and create some real data to examine. These are actual numbers from houses for sale, and we will be trying to find the value of a house we are supposed to sell: Size (m2) Land (m2) Rooms Granite Extra bathroom Price 1076 2801 6 0 0 €324.500,00 990 3067 5 1 1 €466.000,00 1229 3094 5 0 1 €425.900,00 731 4315 4 1 0 €387.120,00 671 2926 4 0 1 €312.100,00 1078 6094 6 1 1 €603.000,00 909 2854 5 0 1 €383.400,00 975 2947 5 1 1 ?? To load files in Weka, we have to put the table in the ARFF file format and save it as house.arff. Make sure the attributes are numeric, as shown here: @RELATION house@ATTRIBUTE size NUMERIC@ATTRIBUTE land NUMERIC@ATTRIBUTE rooms NUMERIC@ATTRIBUTE granite NUMERIC@ATTRIBUTE extra_bathroom NUMERIC@ATTRIBUTE price NUMERIC@DATA1076,2801,6,0,0,324500990,3067,5,1,1,4660001229,3094,5,0,1,425900731,4315,4,1,0,387120671,2926,4,0,1,3121001078,6094,6,1,1,603000909,2854,5,0,1,383400975,2947,5,1,1,? How to do it… Use the following snippet: import java.io.BufferedReader;import java.io.FileReader;import weka.core.Instance;import weka.core.Instances;import weka.classifiers.functions.LinearRegression;public class Regression{public static void main(String args[]) throws Exception{//load dataInstances data = new Instances(new BufferedReader(newFileReader("dataset/house.arff")));data.setClassIndex(data.numAttributes() - 1);//build modelLinearRegression model = new LinearRegression();model.buildClassifier(data); //the last instance with missingclass is not usedSystem.out.println(model);//classify the last instanceInstance myHouse = data.lastInstance();double price = model.classifyInstance(myHouse);System.out.println("My house ("+myHouse+"): "+price);}} Here is the output: Linear Regression Modelprice =195.2035 * size +38.9694 * land +76218.4642 * granite +73947.2118 * extra_bathroom +2681.136My house (975,2947,5,1,1,?): 458013.16703945777 How it works… Import a basic regression model named weka.classifiers.functions.LinearRegression: import java.io.BufferedReader;import java.io.FileReader;import weka.core.Instance;import weka.core.Instances;import weka.classifiers.functions.LinearRegression; Load the house dataset: Instances data = new Instances(new BufferedReader(newFileReader("dataset/house.arff")));data.setClassIndex(data.numAttributes() - 1); Initialize and build a regression model. Note, that the last instance is not used for building the model since the class value is missing: LinearRegression model = new LinearRegression();model.buildClassifier(data); Output the model: System.out.println(model); Use the model to predict the price of the last instance in the dataset: Instance myHouse = data.lastInstance();double price = model.classifyInstance(myHouse);System.out.println("My house ("+myHouse+"): "+price); There’s more This section lists some additional algorithms. Other regression algorithms There is a wide variety of implemented regression algorithms one can use in Weka: weka.classifiers.rules.ZeroR: The class for building and using an 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class) and it is considered as a baseline; that is, if your classifier's performance is worse than average value predictor, it is not worth considering it. weka.classifiers.trees.REPTree: The fast decision tree learner. Builds a decision/regression tree using information gain/variance and prunes it using reduced-error pruning (with backfitting). It only sorts values for numeric attributes once. Missing values are dealt with by splitting the corresponding instances into pieces (that is, as in C4.5). weka.classifiers.functions.SMOreg: SMOreg implements the support vector machine for regression. The parameters can be learned using various algorithms. The algorithm is selected by setting the RegOptimizer. The most popular algorithm (RegSMOImproved) is due to Shevade, Keerthi, and others, and this is the default RegOptimizer. weka.classifiers.functions.MultilayerPerceptron: A classifier that uses backpropagation to classify instances. This network can be built by hand, or created by an algorithm, or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid (except for when the class is numeric in which case the output nodes become unthresholded linear units). weka.classifiers.functions.GaussianProcesses: Implements Gaussian Processes for regression without hyperparameter-tuning. Summary We learned how to use models that predict a value of numerical class, in contrast to classification, which predicts the value of a nominal class. Given a set of attributes, the regression builds a model, usually an equation that is used to compute the predicted class value. Resources for Article: Further resources on this subject: Java in Oracle Database [Article] Installing and Setting up JavaFX for NetBeans and Eclipse IDE [Article] Getting Started with JavaFX [Article]

0
0
10835

article-image-entity-framework-code-first-accessing-database-views-and-stored-procedures

Packt

18 Mar 2015

15 min read

Entity Framework Code-First: Accessing Database Views and Stored Procedures

Packt

18 Mar 2015

15 min read

0
0
10826

article-image-dimensionality-reduction-principal-component-analysis

Packt

18 Aug 2015

12 min read

Dimensionality Reduction with Principal Component Analysis

Packt

18 Aug 2015

12 min read

In this article by Eric Mayor, author of the book Learning Predictive Analytics with R, we will be discussing how to use PCA in R. Nowadays, the access to data is easier and cheaper than ever before. This leads to a proliferation of data in organizations' data warehouses and on the Internet. Analyzing this data is not trivial, as the quantity often makes the analysis difficult or unpractical. For instance, the data is often more abundant than available memory on the machines. The available computational power is also often not enough to analyze the data in a reasonable time frame. One solution is to have recourse to technologies that deal with high dimensionality in data (big data). These solutions typically use the memory and computing power of several machines for analysis (computer clusters). However, most organizations do not have such infrastructure. Therefore, a more practical solution is to reduce the dimensionality of the data while keeping the essential of the information intact. (For more resources related to this topic, see here.) Another reason to reduce dimensionality is that, in some cases, there are more attributes than observations. If some scientists were to store the genome of all inhabitants of Europe and the United States, the number of cases (approximately 1 billion) would be much less than the 3 billion base pairs in the human DNA. Most analyses do not work well or not at all when observations are less than attributes. Confronted with this problem, the data analysts might select groups of attributes, which go together (like height and weight, for instance) according to their domain knowledge, and reduce the dimensionality of the dataset. The data is often structured across several relatively independent dimensions, with each dimension measured using several attributes. This is where principal component analysis (PCA) is an essential tool, as it permits each observation to receive a score on each of the dimensions (determined by PCA itself) while allowing to discard the attributes from which the dimensions are computed. What is meant here is that for each of the obtained dimensions, values (scores) will be produced that combines several attributes. These can be used for further analysis. In the next section, we will use PCA to combine several attributes in the questionnaire data. We will see that the participants' self-report on items such as lively, excited, enthusiastic (and many more) can be combined in a single dimension we call positive arousal. In this sense, PCA performs both dimensionality reduction (discard the attributes) and feature extraction (compute the dimensions). The features can then be used for classification and regression problems. Another use of PCA is to check that the underlying structure of the data corresponds to a theoretical model. For instance, in a questionnaire, a group of questions might assess construct A, and another group construct B, and so on. PCA will find two different factors if indeed there is more similarity in the answers of participants within each group of questions compared to the overall questionnaire. Researchers in fields such as psychology use PCA mainly to test their theoretical model in this fashion. This article is a shortened version of the chapter with the same name in my book Learning predictive analytics with R. In what follows, we will examine how to use PCA in R. We will notably discover how to interpret the results and perform diagnostics. Learning PCA in R In this section, we will learn more about how to use PCA in order to obtain knowledge from data, and ultimately reduce the number of attributes (also called features). The first dataset we will use is the msg dataset from the psych package. The msq (for motivational state questionnaire) data set is composed of 92 attributes, of which 72 is the rating of adjectives by 3896 participants describing their mood. We will only use these 72 attributes for the current purpose, which is the exploration of the structure of the questionnaire. We will, therefore, start by installing and loading the package and the data, and assign the data we are interested in (the 75 attributes mentioned) to an object called motiv. install.packages("psych") library(psych) data(msq) motiv = msq[,1:72] Dealing with missing values Missing values are a common problem in real-life datasets like the one we use here. There are several ways to deal with them, but here we will only mention omitting the cases where missing values are encountered. Let's see how many missing values (we will call them NAs) there are for each attribute in this dataset: apply(is.na(motiv),2,sum) View of missing data in the dataset We can see that for many attributes this is unproblematic (there are only few missing values). However, in the case of several attributes, the number of NAs is quite high (anxious: 1849 NAs, cheerful: 1850, idle: 1848, inactive: 1846, tranquil: 1843). The most probable explanation is that these items have been dropped from some of the samples in which the data has been collected. Removing all cases with NAs from the dataset would result in an empty dataset in this particular case. Try the following in your console to verify this claim: na.omit(motiv) For this reason, we will simply deal with the problem by removing these attributes from the analysis. Remember that there are other ways, such as data imputation, to solve this issue while keeping the attributes. In order to do so, we first need to know which column number corresponds to the attributes we want to suppress. An easy way to do this is simply by printing the names of each columns vertically (using cbind() for something it was not exactly made for). We will omit cases with missing values on other attributes from the analysis later. head(cbind(names(motiv)),5) Here, we only print the result for the first five columns of the data frame: [,1] [1,] active [2,] afraid [3,] alert [4,] angry [5,] anxious We invite the reader to verify on his screen whether we suppress the correct columns from the analysis using the following vector to match the attributes: ToSuppress = c(5, 15, 37, 38, 66) Let's check whether it is correct: names(motiv[ToSuppress]) The following is the output: [1] "anxious" "cheerful" "idle" "inactive" "tranquil" Naming the components using the loadings We will run the analysis using the principal() function from the psych package. This will allow examining the loadings in a sorted fashion and making them independent from the others using a rotation. We will extract five components. The psych package has already been loaded, as the data we are analyzing comes from a dataset, which is included in psych. Yet, we need to install another package, the GPArotation package, which is required for the analysis we will perform next: install.packages("GPArotation") library(GPArotation) We can now run the analysis. This time we want to apply an orthogonal rotation varimax, in order to obtain independent factorial scores. Precisions are provided in the paper Principal component analysis by Abdi and Williams (2010). We also want that the analysis uses imputed data for missing values and estimates the scores for each observation on each retained principal component. We use the print.psych() function to print the sorted loadings, which will make the interpretation of the principal components easier: Pca2 = principal(motiv[,-ToSuppress],nfactors = 5, rotate = "varimax", missing = T, scores = T) print.psych(Pca2, sort =T) The annotated output displayed in the following screenshot has been slightly altered in order to allow it to fit on one page. Results of the PCA with the principal() function using varimax rotation The name of the items are displayed in the first column of the first matrix of results. The column item displays their order. The component loadings for the five retained components are displayed next (RC1 to RC5). We will comment on h2 and u2 later. The loadings of the attributes on the loadings can be used to name the principal components. As can be seen in the preceding screenshot, the first component could be called Positive arousal, the second component could be called Negative arousal, the third could be called Serenity, the fourth Exhaustion, and the last component could be called Fear. It is worth noting that the MSQ has theoretically four components obtained from two dimensions: energy and tension. Therefore, the fifth component we found is not accounted for by the model. The h2 value indicates the proportion of variance of the variable that is accounted for by the selected components. The u2 value indicates the part of variance not explained by the components. The sum of h2 and u2 is 1. Below the first result matrix, a second matrix indicates the proportion of variance explained by the components on the whole dataset. For instance, the first component explains 21 percent of the variance in the dataset Proportion Var and 38 percent of the variance explained by the five components. Overall, the five components explain 56 percent of the variance in the dataset Cumulative Var. As a remainder, the purpose of PCA is to replace all the original attributes by the scores on these components in further analysis. This is what we will discuss next. PCA scores At this point, you might wonder where data reduction comes into play. Well, the computation of the scores for each factor allows reducing the number of attributes for a dataset. This has been done in the previous section. Accessing the PCA scores We will now examine these scores more thoroughly. Let's start by checking whether the scores are indeed uncorrelated: round(cor(Pca2$scores),3) This is the output: RC1 RC2 RC3 RC4 RC5 RC1 1.000 -0.001 0.000 0.000 0.001 RC2 -0.001 1.000 0.000 0.000 0.000 RC3 0.000 0.000 1.000 -0.001 0.001 RC4 0.000 0.000 -0.001 1.000 0.000 RC5 0.001 0.000 0.001 0.000 1.000 As can be seen in the previous output, the correlations between the values are equal to or lower than 0.001 for every pair of components. The components are basically uncorrelated, as we requested. Let's start by checking that it indeed contains the same number of rows: nrow(Pca2$scores) == nrow(msq) Here's the output: [1] TRUE As the principal() function didn't remove any cases but imputed the data instead, we can now append the factor scores to the original dataset (a copy of it): bound = cbind(msq,Pca2$score) PCA diagnostics Here, we will briefly discuss two diagnostics that can be performed on the dataset that will be subjected to analysis. These diagnostics should be performed analyzing the data with PCA in order to ascertain that PCA is an optimal analysis for the considered data. The first diagnostic is Bartlett's test of sphericity. This test examines the relationship between the variables together, instead of 2 x 2 as in a correlation. To be more specific, it tests whether the correlation matrix is different from an identity matrix (which has ones on the diagonals and zeroes elsewhere). The null hypothesis is that the variables are independent (no underlying structure). The paper When is a correlation matrix appropriate for factor analysis? Some decision rules by Dzuiban and Shirkey (1974) provides more information on the topic. This test can be performed by the cortest.normal() function in the psych package. In this case, the first argument is the correlation matrix of the data we want to subject to PCA, and the n1 argument is the number of cases. We will use the same data set as in the PCA (we first assign the name M to it): M = na.omit(motiv[-ToSuppress]) cortest.normal(cor(M), n1 = nrow(M)) The output is provided as follows: Tests of correlation matrices Call:cortest.normal(R1 = cor(M), n1 = nrow(M)) Chi Square value 829268 with df = 2211 with probability < 0 The last line of the output shows that the data subjected to the analysis is clearly different from an identity matrix: The probability to obtain these results if it were an identity matrix is close to zero (do not pay attention to the > 0, it is simply extremely close to zero). You might be curious about what is the output of an identity matrix. It turns out we have something close to it: the correlations of the PCA scores with the varimax rotation that we examined before. Let's subject the scores to the analysis: cortest.normal(cor(Pca2$scores), n1 = nrow(Pca2$scores)) Tests of correlation matrices Call:cortest.normal(R1 = cor(Pca2$scores), n1 = nrow(Pca2$scores)) Chi Square value 0.01 with df = 10 with probability < 1 In this case, the results show that the correlation matrix is not significantly different from an identity matrix. The other diagnostic is the Kaiser Meyer Olkin (KMO) index, which indicates the part of the data that can be explained by elements present in the dataset. The higher this score the more the proportion of the data is explainable, for instance, by PCA. The KMO (also called MSA for Measure of Sample Adequacy) ranges from 0 (nothing is explainable) to 1 (everything is explainable). It can be returned for each item separately, or for the overall dataset. We will examine this value. It is the first component of the object returned by the KMO()function from psych package. The second component is the list of the values for the individual items (not examined here). It simply takes a matrix, a data frame, or a correlation matrix as an argument. Let's run in on our data: KMO(motiv)[1] This returns a value of 0.9715381, meaning that most of our data is explainable by the analysis. Summary In this article, we have discussed how the PCA algorithm works, how to select the appropriate number of components and how to use PCA scores for further analysis. Resources for Article: Further resources on this subject: Big Data Analysis (R and Hadoop) [article] How to do Machine Learning with Python [article] Clustering and Other Unsupervised Learning Methods [article]

0
0
10701

article-image-analyzing-textual-data-using-the-nltk-library

Sugandha Lahoti

24 Feb 2018

16 min read

Analyzing Textual Data using the NLTK Library

Sugandha Lahoti

24 Feb 2018

16 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Armando Fandango titled Python Data Analysis - Second Edition. This book will help you learn to apply powerful data analysis techniques with popular open source Python modules. Code bundle for this article is hosted on GitHub.[/box] In this book excerpt, we will talk about various ways of performing text analytics using the NLTK Library. Natural Language Toolkit (NLTK) is one of the main libraries used for text analysis in Python. It comes with a collection of sample texts called corpora. Let's install the libraries required in this article with the following command: $ pip3 install nltk scikit-learn NLTK is a Python API for the analysis of texts written in natural languages, such as English. NLTK was created in 2001 and was originally intended as a teaching tool. Although we installed NLTK in the previous section, we are not done yet; we still need to download the NLTK corpora. The download is relatively large (about 1.8 GB); however, we only have to download it once. Unless you know exactly which corpora you require, it's best to download all the available corpora. Download the corpora from the Python shell as follows: $ python3 >>> import nltk >>> nltk.download() A GUI application should appear, where you can specify a destination and what file to download. If you are new to NLTK, it's most convenient to choose the default option and download everything. In this article, we will need the stopwords, movie reviews, names, and Gutenberg corpora. Readers are encouraged to follow the sections in the ch-09.ipynb file. Filtering out stopwords, names, and numbers Stopwords are common words that have very low information value in a text. It is a common practice in text analysis to get rid of stopwords. NLTK has a stopwords corpora for a number of languages. Load the English stopwords corpus and print some of the words: sw = set(nltk.corpus.stopwords.words('english')) print("Stop words:", list(sw)[:7]) The following common words are printed: Stop words: ['between', 'who', 'such', 'ourselves', 'an', 'ain', 'ours'] Note that all the words in this corpus are in lowercase. NLTK also has a Gutenberg corpus. The Gutenberg project is a digital library of books, mostly with expired copyright, which are available for free on the Internet (see http://www.gutenberg.org/). Load the Gutenberg corpus and print some of its filenames: gb = nltk.corpus.gutenberg print("Gutenberg files:n", gb.fileids()[-5:]) Some of the titles printed may be familiar to you: Gutenberg files: ['milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] Extract the first couple of sentences from the milton-paradise.txt file, which we will filter later: text_sent = gb.sents("milton-paradise.txt")[:2] print("Unfiltered:", text_sent) The following sentences are printed: Unfiltered [['[', 'Paradise', 'Lost', 'by', 'John', 'Milton', '1667', ']'], ['Book', 'I']] Now, filter out the stopwords as follows: for sent in text_sent: filtered = [w for w in sent if w.lower() not in sw] print("Filtered:n", filtered) For the first sentence, we get the following output: Filtered ['[', 'Paradise', 'Lost', 'John', 'Milton', '1667', ']'] If we compare this with the previous snippet, we notice that the word by has been filtered out as it was found in the stopwords corpus. Sometimes, we want to remove numbers and names too. We can remove words based on part of speech (POS) tags. In this tagging scheme, numbers correspond to the cardinal number (CD) tag. Names correspond to the proper noun singular (NNP) tag. Tagging is an inexact process based on heuristics. It's a big topic that deserves an entire book. Tag the filtered text with the pos_tag() function: tagged = nltk.pos_tag(filtered) print("Tagged:n", tagged) For our text, we get the following tags: Tagged [('[', 'NN'), ('Paradise', 'NNP'), ('Lost', 'NNP'), ('John', 'NNP'), ('Milton', 'NNP'), ('1667', 'CD'), (']', 'CD')] The pos_tag() function returns a list of tuples, where the second element in each tuple is the tag. As you can see, some of the words are tagged as NNP, although they probably shouldn't be. The heuristic here is to tag words as NNP if the first character of a word is uppercase. If we set all the words to be lowercase, we will get a different result. This is left as an exercise for the reader. It's easy to remove the words in the list with the NNP and CD tags, as described in the following code: words= [] for word in tagged: if word[1] != 'NNP' and word[1] != 'CD': words.append(word[0]) print(words) Have a look at the ch-09.ipynb file in the book’s code bundle: import nltk sw = set(nltk.corpus.stopwords.words('english')) print(“Stop words:", list(sw)[:7]) gb = nltk.corpus.gutenberg print(“Gutenberg files:n", gb.fileids()[-5:]) text_sent = gb.sents("milton-paradise.txt")[:2] print(“Unfiltered:", text_sent) for sent in text_sent: filtered = [w for w in sent if w.lower() not in sw] print("Filtered:n", filtered) tagged = nltk.pos_tag(filtered) print("Tagged:n", tagged) words= [] for word in tagged: if word[1] != 'NNP' and word[1] != 'CD': words.append(word[0]) print(“Words:n",words) The bag-of-words model In the bag-of-words model, we create from a document a bag containing words found in the document. In this model, we don't care about the word order. For each word in the document, we count the number of occurrences. With these word counts, we can do statistical analysis, for instance, to identify spam in e-mail messages. If we have a group of documents, we can view each unique word in the corpus as a feature; here, feature means parameter or variable. Using all the word counts, we can build a feature vector for each document; vector is used here in the mathematical sense. If a word is present in the corpus but not in the document, the value of this feature will be 0. Surprisingly, NLTK doesn't currently have a handy utility to create a feature vector. However, the machine learning Python library, scikit-learn, does have a CountVectorizer class that we can use. Load two text documents from the NLTK Gutenberg corpus: hamlet = gb.raw("shakespeare-hamlet.txt") macbeth = gb.raw("shakespeare-macbeth.txt") Create the feature vector by omitting English stopwords: cv = sk.feature_extraction.text.CountVectorizer(stop_words='english') print("Feature vector:n", cv.fit_transform([hamlet, macbeth]).toarray()) These are the feature vectors for the two documents: Feature vector: [[ 1 0 1 ..., 14 0 1] [ 0 1 0 ..., 1 1 0]] Print a small selection of the features (unique words) that we found: print("Features:n", cv.get_feature_names()[:5]) The features are given in alphabetical order: Features: ['1599', '1603', 'abhominably', 'abhorred', 'abide'] Have a look at the ch-09.ipynb file in this book’s code bundle: import nltk import sklearn as sk hamlet = gb.raw("shakespeare-hamlet.txt") macbeth = gb.raw("shakespeare-macbeth.txt") cv = sk.feature_extraction.text.CountVectorizer(stop_words='english') print(“Feature vector:n”, cv.fit_transform([hamlet, macbeth]).toarray()) print("Features:n", cv.get_feature_names()[:5]) Analyzing word frequencies The NLTK FreqDist class encapsulates a dictionary of words and counts for a given list of words. Load the Gutenberg text of Julius Caesar by William Shakespeare. Let's filter out the stopwords and punctuation: punctuation = set(string.punctuation) filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation] Create a FreqDist object and print the associated keys and values with the highest frequency: fd = nltk.FreqDist(filtered) print("Words", fd.keys()[:5]) print("Counts", fd.values()[:5]) The keys and values are printed as follows: Words ['d', 'caesar', 'brutus', 'bru', 'haue'] Counts [215, 190, 161, 153, 148] The first word in this list is, of course, not an English word, so we may need to add the heuristic that words have a minimum of two characters. The NLTK FreqDist class allows dictionary-like access, but it also has convenience methods. Get the word with the highest frequency and related count: print("Max", fd.max()) print("Count", fd['d']) The following result shouldn't be a surprise: Max d Count 215 Up until this point, the analysis has focused on single words, but we can extend the analysis to word pairs and triplets. These are also called bigrams and trigrams. We can find them with the bigrams() and trigrams() functions. Repeat the analysis, but this time for bigrams: fd = nltk.FreqDist(nltk.bigrams(filtered)) print("Bigrams", fd.keys()[:5]) print("Counts", fd.values()[:5]) print("Bigram Max", fd.max()) print("Bigram count", fd[('let', 'vs')]) The following output should be printed: Bigrams [('let', 'vs'), ('wee', 'l'), ('mark', 'antony'), ('marke', 'antony'), ('st', 'thou')] Counts [16, 15, 13, 12, 12] Bigram Max ('let', 'vs') Bigram count 16 Have a peek at the ch-09.ipynb file in this book's code bundle: import nltk import string gb = nltk.corpus.gutenberg words = gb.words("shakespeare-caesar.txt") sw = set(nltk.corpus.stopwords.words('english')) punctuation = set(string.punctuation) filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation] fd = nltk.FreqDist(filtered) print("Words", fd.keys()[:5]) print("Counts", fd.values()[:5]) print("Max", fd.max()) print("Count", fd['d']) fd = nltk.FreqDist(nltk.bigrams(filtered)) print("Bigrams", fd.keys()[:5]) print("Counts", fd.values()[:5]) print("Bigram Max", fd.max()) print("Bigram count", fd[('let', 'vs')]) Naive Bayes classification Classification algorithms are a type of machine learning algorithm that determine the class (category or type) of a given item. For instance, we could try to determine the genre of a movie based on some features. In this case, the genre is the class to be predicted. In this section, we will discuss a popular algorithm called Naive Bayes classification, which is frequently used to analyze text documents. Naive Bayes classification is a probabilistic algorithm based on the Bayes theorem from probability theory and statistics. The Bayes theorem formulates how to discount the probability of an event based on new evidence. For example, imagine that we have a bag with pieces of chocolate and other items we can't see. We will call the probability of drawing a piece of dark chocolate P(D). We will denote the probability of drawing a piece of chocolate as P(C). Of course, the total probability is always 1, so P(D) and P(C) can be at most 1. The Bayes theorem states that the posterior probability is proportional to the prior probability times likelihood: P(D|C) in the preceding notation means the probability of event D given C. When we haven't drawn any items, P(D) = 0.5 because we don't have any information yet. To actually apply the formula, we need to know P(C|D) and P(C), or we have to determine those indirectly. Naive Bayes classification is called naive because it makes the simplifying assumption of independence between features. In practice, the results are usually pretty good, so this assumption is often warranted to a certain level. Recently, it was found that there are theoretical reasons why the assumption makes sense. However, since machine learning is a rapidly evolving field, algorithms have been invented with (slightly) better performance. Let's try to classify words as stopwords or punctuation. As a feature, we will use the word length, since stopwords and punctuation tend to be short. This setup leads us to define the following functions: def word_features(word): return {'len': len(word)} def isStopword(word): return word in sw or word in punctuation Label the words in the Gutenberg shakespeare-caesar.txt based on whether or not they are stopwords: labeled_words = ([(word.lower(), isStopword(word.lower())) for word in words]) random.seed(42) random.shuffle(labeled_words) print(labeled_words[:5]) The 5 labeled words will appear as follows: [('was', True), ('greeke', False), ('cause', False), ('but', True), ('house', False)] For each word, determine its length: featuresets = [(word_features(n), word) for (n, word) in labeled_words] We will train a naive Bayes classifier on 90 percent of the words and test the remaining 10 percent. Create the train and the test set, and train the data: cutoff = int(.9 * len(featuresets)) train_set, test_set = featuresets[:cutoff], featuresets[cutoff:] classifier = nltk.NaiveBayesClassifier.train(train_set) We can now check how the classifier labels the words in the sets: classifier = nltk.NaiveBayesClassifier.train(train_set) print("'behold' class", classifier.classify(word_features('behold'))) print("'the' class", classifier.classify(word_features('the'))) Fortunately, the words are properly classified: 'behold' class False 'the' class True Determine the classifier accuracy on the test set as follows: print("Accuracy", nltk.classify.accuracy(classifier, test_set)) We get a high accuracy for this classifier of around 85 percent. Print an overview of the most informative features: print(classifier.show_most_informative_features(5)) The overview shows the word lengths that are most useful for the classification process: The code is in the ch-09.ipynb file in this book's code bundle: import nltk import string import random sw = set(nltk.corpus.stopwords.words('english')) punctuation = set(string.punctuation) def word_features(word): return {'len': len(word)} def isStopword(word): return word in sw or word in punctuation gb = nltk.corpus.gutenberg words = gb.words("shakespeare-caesar.txt") labeled_words = ([(word.lower(), isStopword(word.lower())) for word in words]) random.seed(42) random.shuffle(labeled_words) print(labeled_words[:5]) featuresets = [(word_features(n), word) for (n, word) in labeled_words] cutoff = int(.9 * len(featuresets)) train_set, test_set = featuresets[:cutoff], featuresets[cutoff:] classifier = nltk.NaiveBayesClassifier.train(train_set) print("'behold' class", classifier.classify(word_features('behold'))) print("'the' class", classifier.classify(word_features('the'))) print("Accuracy", nltk.classify.accuracy(classifier, test_set)) print(classifier.show_most_informative_features(5)) Sentiment analysis Opinion mining or sentiment analysis is a hot new research field dedicated to the automatic evaluation of opinions as expressed on social media, product review websites, or other forums. Often, we want to know whether an opinion is positive, neutral, or negative. This is, of course, a form of classification, as seen in the previous section. As such, we can apply any number of classification algorithms. Another approach is to semi-automatically (with some manual editing) compose a list of words with an associated numerical sentiment score (the word “good” can have a score of 5 and the word “bad” a score of -5). If we have such a list, we can look up all the words in a text document and, for example, sum up all the found sentiment scores. The number of classes can be more than three, as in a five-star rating scheme. We will apply naive Bayes classification to the NLTK movie reviews corpus with the goal of classifying movie reviews as either positive or negative. First, we will load the corpus and filter out stopwords and punctuation. These steps will be omitted, since we have performed them before. You may consider more elaborate filtering schemes, but keep in mind that excessive filtering may hurt accuracy. Label the movie reviews documents using the categories() method: labeled_docs = [(list(movie_reviews.words(fid)), cat) for cat in movie_reviews.categories() for fid in movie_reviews.fileids(cat)] The complete corpus has tens of thousands of unique words that we can use as features. However, using all these words might be inefficient. Select the top 5 percent of the most frequent words: words = FreqDist(filtered) N = int(.05 * len(words.keys())) word_features = words.keys()[:N] For each document, we can extract features using a number of methods, including the following: Check whether the given document has a word or not Determine the number of occurrences of a word for a given document Normalize word counts so that the maximum normalized word count will be less than or equal to 1 Take the logarithm of counts plus 1 (to avoid taking the logarithm of zero) Combine all the previous points into one metric As the saying goes, all roads lead to Rome. Of course, some roads are safer and will bring you to Rome faster. Define the following function, which uses raw word counts as a metric: def doc_features(doc): doc_words = FreqDist(w for w in doc if not isStopWord(w)) features = {} for word in word_features: features['count (%s)' % word] = (doc_words.get(word, 0)) return features We can now train our classifier just as we did in the previous example. An accuracy of 78 percent is reached, which is decent and comes close to what is possible with sentiment analysis. Research has found that even humans don't always agree on the sentiment of a given document (see http://mashable.com/2010/04/19/sentiment-analysis/), and therefore, we can't have a 100 percent perfect accuracy with sentiment analysis software. The most informative features are printed as follows: If we go through this list, we find obvious positive words such as “wonderful” and “outstanding”. The words “bad”, “stupid”, and “boring” are the obvious negative words. It would be interesting to analyze the remaining features. This is left as an exercise for the reader. Refer to the sentiment.py file in this book's code bundle: import random from nltk.corpus import movie_reviews from nltk.corpus import stopwords from nltk import FreqDist from nltk import NaiveBayesClassifier from nltk.classify import accuracy import string labeled_docs = [(list(movie_reviews.words(fid)), cat) for cat in movie_reviews.categories() for fid in movie_reviews.fileids(cat)] random.seed(42) random.shuffle(labeled_docs) review_words = movie_reviews.words() print("# Review Words", len(review_words)) sw = set(stopwords.words('english')) punctuation = set(string.punctuation) def isStopWord(word): return word in sw or word in punctuation filtered = [w.lower() for w in review_words if not isStopWord(w.lower())] print("# After filter", len(filtered)) words = FreqDist(filtered) N = int(.05 * len(words.keys())) word_features = words.keys()[:N] def doc_features(doc): doc_words = FreqDist(w for w in doc if not isStopWord(w)) features = {} for word in word_features: features['count (%s)' % word] = (doc_words.get(word, 0)) return features featuresets = [(doc_features(d), c) for (d,c) in labeled_docs] train_set, test_set = featuresets[200:], featuresets[:200] classifier = NaiveBayesClassifier.train(train_set) print("Accuracy", accuracy(classifier, test_set)) print(classifier.show_most_informative_features()) We covered textual analysis and learned that it's a best practice to get rid of stopwords. In the bag-of-words model, we used a document to create a bag containing words found in that same document. We learned how to build a feature vector for each document using all the word counts. Classification algorithms are a type of machine learning algorithm, which involve determining the class of a given item. Naive Bayes classification is a probabilistic algorithm based on the Bayes theorem from probability theory and statistics. The Bayes theorem states that the posterior probability is proportional to the prior probability multiplied by the likelihood. If you liked this post, check out the book Python Data Analysis - Second Edition to know more about analyzing other forms of textual data and social media analysis.

0
0
10646

article-image-monte-carlo-simulation-and-options

Packt

21 Apr 2014

10 min read

Monte Carlo Simulation and Options

Packt

21 Apr 2014

10 min read

0
0
10623

article-image-tech-regulation-heats-up-australias-abhorrent-violent-material-bill-to-warrens-corporate-executive-accountability-act

Fatema Patrawala

04 Apr 2019

6 min read

Tech regulation to an extent of sentence jail: Australia’s ‘Sharing of Abhorrent Violent Material Bill’ to Warren’s ‘Corporate Executive Accountability Act’

Fatema Patrawala

04 Apr 2019

6 min read

Businesses in powerful economies like USA, UK, Australia are as arguably powerful as politics or more than that. Especially now that we inhabit a global economy where an intricate web of connections can show the appalling employment conditions of Chinese workers who assemble the Apple smartphones we depend on. Amazon holds a revenue bigger than Kenya’s GDP. According to Business Insider, 25 major American corporations have revenues greater than the GDP of countries around the world. Because corporations create millions of jobs and control vast amounts of money and resources, their sheer economic power dwarfs government's ability to regulate and oversee them. With the recent global scale scandals that the tech industry has found itself in, with some resulting in deaths of groups of people, governments are waking up to the urgency for the need to hold tech companies responsible. While some government laws are reactionary, others are taking a more cautious approach. One thing is for sure, 2019 will see a lot of tech regulation come to play. How effective they are and what intended and unintended consequences they bear, how masterfully big tech wields its lobbying prowess, we’ll have to wait and see. Holding Tech platforms enabling hate and violence, accountable Australian govt passes law that criminalizes companies and execs for hosting abhorrent violent content Today, Australian parliament has passed legislation to crack down on violent videos on social media. The bill, described the attorney general, Christian Porter, as “most likely a world first”, was drafted in the wake of the Christchurch terrorist attack by a White supremacist Australian, when video of the perpetrator’s violent attack spread on social media faster than it could be removed. The Sharing of Abhorrent Violent Material bill creates new offences for content service providers and hosting services that fail to notify the Australian federal police about or fail to expeditiously remove videos depicting “abhorrent violent conduct”. That conduct is defined as videos depicting terrorist acts, murders, attempted murders, torture, rape or kidnap. The bill creates a regime for the eSafety Commissioner to notify social media companies that they are deemed to be aware they are hosting abhorrent violent material, triggering an obligation to take it down. While the Digital Industry Group which consists of Google, Facebook, Twitter, Amazon and Verizon Media in Australia has warned that the bill is passed without meaningful consultation and threatens penalties against content created by users. Sunita Bose, the group’s managing director says, “ with the vast volumes of content uploaded to the internet every second, this is a highly complex problem”. She further debates that “this pass it now, change it later approach to legislation creates immediate uncertainty to the Australia’s tech industry”. The Chief Executive of Atlassian Scott Farquhar said that the legislation fails to define how “expeditiously” violent material should be removed, and did not specify on who should be punished in the social media company. https://twitter.com/scottfarkas/status/1113391831784480768 The Law Council of Australia president, Arthur Moses, said criminalising social media companies and executives was a “serious step” and should not be legislated as a “knee-jerk reaction to a tragic event” because of the potential for unintended consequences. Contrasting Australia’s knee-jerk legislation, the US House Judiciary committee has organized a hearing on white nationalism and hate speech and their spread online. They have invited social media platform execs and civil rights organizations to participate. Holding companies accountable for reckless corporate behavior Facebook has undergone scandals after scandals with impunity in recent years given the lack of legislation in this space. Facebook has repeatedly come under the public scanner for data privacy breaches to disinformation campaigns and beyond. Adding to its ever-growing list of data scandals yesterday CNN Business uncovered hundreds of millions of Facebook records were stored on Amazon cloud servers in a way that it allowed to be downloaded by the public. Earlier this month on 8th March, Sen. Warren has proposed to build strong anti-trust laws and break big tech companies like Amazon, Google, Facebook and Apple. Yesterday, she introduced Corporate Executive Accountability Act and also reintroduced the “too big to fail” bill a new piece of legislation that would make it easier to criminally charge company executives when Americans’ personal data is breached, among other corporate negligent behaviors. “When a criminal on the street steals money from your wallet, they go to jail. When small-business owners cheat their customers, they go to jail,” Warren wrote in a Washington Post op-ed published on Wednesday morning. “But when corporate executives at big companies oversee huge frauds that hurt tens of thousands of people, they often get to walk away with multimillion-dollar payouts.” https://twitter.com/SenWarren/status/1113448794912382977 https://twitter.com/SenWarren/status/1113448583771185153 According to Elizabeth, just one banker went to jail after the 2008 financial crisis. The CEO of Wells Fargo and his successor walked away from the megabank with multimillion-dollar pay packages after it was discovered employees had created millions of fake accounts. The same goes for the Equifax CEO after its data breach. The new legislation Warren introduced would make it easier to hold corporate executives accountable for their companies’ wrongdoing. Typically, it’s been hard to prove a case against individual executives for turning a blind eye toward risky or questionable activity, because prosecutors have to prove intent — basically, that they meant to do it. This legislation would change that, Heather Slavkin Corzo, a senior fellow at the progressive nonprofit Americans for Financial Reform, said to the Vox reporter. “It’s easier to show a lack of due care than it is to show the mental state of the individual at the time the action was committed,” she said. A summary of the legislation released by Warren’s office explains that it would “expand criminal liability to negligent executives of corporations with over $1 billion annual revenue” who: Are found guilty, plead guilty, or enter into a deferred or non-prosecution agreement for any crime. Are found liable or enter a settlement with any state or Federal regulator for the violation of any civil law if that violation affects the health, safety, finances, or personal data of 1% of the American population or 1% of the population of any state. Are found liable or guilty of a second civil or criminal violation for a different activity while operating under a civil or criminal judgment of any court, a deferred prosecution or non prosecution agreement, or settlement with any state or Federal agency. Executives found guilty of these violations could get up to a year in jail. And a second violation could mean up to three years. The Corporate Executive Accountability Act is yet another push from Warren who has focused much of her presidential campaign on holding corporations and their leaders responsible for both their market dominance and perceived corruption. Elizabeth Warren wants to break up tech giants like Amazon, Google Facebook, and Apple and build strong antitrust laws Zuckerberg wants to set the agenda for tech regulation in yet another “digital gangster” move Facebook under criminal investigations for data sharing deals: NYT report

0
0
10615

article-image-troll-patrol-report-amnesty-international-and-element-ai-use-machine-learning-to-understand-online-abuse-against-women

Sugandha Lahoti

18 Dec 2018

5 min read

Troll Patrol Report: Amnesty International and Element AI use machine learning to understand online abuse against women

Sugandha Lahoti

18 Dec 2018

5 min read

Amnesty International has partnered with Element AI to release a Troll Patrol report on the online abuse against women on Twitter. This finding was a part of their Troll patrol project which invites human rights researchers, technical experts, and online volunteers to build a crowd-sourced dataset of online abuse against women. https://twitter.com/amnesty/status/1074946094633836544 Abuse of women on social media websites has been rising at an unprecedented rate. Social media websites have a responsibility to respect human rights and to ensure that women using the platform are able to express themselves freely and without fear. However, this has not been the case with Twitter and Amnesty has unearthed certain discoveries. Amnesty’s methodology was powered by machine learning Amnesty and Element AI surveyed 778 journalists and politicians from the UK and US throughout 2017 and then use machine learning techniques to qualitatively analyze abuse against women. The first process was to design large, unbiased dataset of tweets mentioning 778 women politicians and journalists from the UK and US. Next, over 6,500 volunteers (aged between 18 to 70 years old and from over 150 countries) analyzed 288,000 unique tweets to create a labeled dataset of abusive or problematic content. This was based on simple questions such as if the tweets were abusive or problematic, and if so, whether they revealed misogynistic, homophobic or racist abuse or other types of violence. Three experts also categorized a sample of 1,000 tweets to assess the quality of the tweets labeled by digital volunteers. Element AI used data science specifically using a subset of the Decoders and experts’ categorization of the tweets, to extrapolate the abuse analysis. Key findings from the report Per the findings of the Troll Patrol report, 7.1% of tweets sent to the women in the study were “problematic” or “abusive”. This amounts to 1.1 million tweets mentioning 778 women across the year, or one every 30 seconds. Women of color, (black, Asian, Latinx and mixed-race women) were 34% more likely to be mentioned in abusive or problematic tweets than white women. Black women were disproportionately targeted, being 84% more likely than white women to be mentioned in abusive or problematic tweets. Source: Amnesty Online abuse targets women from across the political spectrum faced similar levels of online abuse and both liberals and conservatives alike, as well as left and right-leaning media organizations, were targeted. Source: Amnesty What does this mean for people in tech Social media organizations are repeatedly failing in their responsibility to protect women’s rights online. They fall short of adequately investigating and responding to reports of violence and abuse in a transparent manner which leads many women to silence or censor themselves on the platform. Such abuses also hinder the freedom of expression online and also undermines women’s mobilization for equality and justice, particularly those groups who already face discrimination and marginalization. What can tech platforms do? One of the recommendations of the report is that social media platforms should publicly share comprehensive and meaningful information about reports of violence and abuse against women, as well as other groups, on their platforms. They should also talk in detail about how they are responding to it. Although Twitter and other platforms are using machine learning for content moderation and flagging, they should be transparent about the algorithms they use. They should publish information about training data, methodologies, moderation policies and technical trade-offs (such as between greater precision or recall) for public scrutiny. Machine learning automation should ideally be part of a larger content moderation system characterized by human judgment, greater transparency, rights of appeal and other safeguards. Amnesty in collaboration with Element AI also developed a machine learning model to better understand the potential and risks of using machine learning in content moderation systems. This model was able to achieve results comparable to their digital volunteers at predicting abuse, although it is ‘far from perfect still’, Amnesty notes. It achieves about a 50% accuracy level when compared to the judgment of experts. It was able to correctly identify 2 in every 14 tweets as abusive or problematic in comparison to experts who identified 1 in every 14 tweets as abusive or problematic. “Troll Patrol isn’t about policing Twitter or forcing it to remove content. We are asking it to be more transparent, and we hope that the findings from Troll Patrol will compel it to make that change. Crucially, Twitter must start being transparent about how exactly they are using machine learning to detect abuse, and publish technical information about the algorithms they rely on”. said Milena Marin senior advisor for tactical research at Amnesty International. Read more: The full list of Amnesty’s recommendations to Twitter. People on Twitter (the irony) are shocked at the release of Amnesty’s report and #ToxicTwitter is trending. https://twitter.com/gregorystorer/status/1074959864458178561 https://twitter.com/blimundaseyes/status/1074954027287396354 https://twitter.com/MikeWLink/status/1074500992266354688 https://twitter.com/BethRigby/status/1074949593438265344 Check out the full Troll Patrol report on Amnesty. Also, check out their machine learning based methodology in detail. Amnesty International takes on Google over Chinese censored search engine, Project Dragonfly. Twitter CEO, Jack Dorsey slammed by users after a photo of him holding ‘smash Brahminical patriarchy’ poster went viral Twitter plans to disable the ‘like’ button to promote healthy conversations; should retweet be removed instead?

0
0
10551

article-image-introduction-ggplot2-and-plotting-environments-r

Packt

25 Jun 2015

15 min read

Introduction to ggplot2 and the plotting environments in R

Packt

25 Jun 2015

15 min read

In this article by Donato Teutonico, author of the book ggplot2 Essentials, we are going to explore different plotting environments in R and subsequently learn about the package, ggplot2. R provides a complete series of options available for realizing graphics, which make this software quite advanced concerning data visualization. The core of the graphics visualization in R is within the package grDevices, which provides the basic structure of data plotting, as for instance the colors and fonts used in the plots. Such graphic engine was then used as starting point in the development of more advanced and sophisticated packages for data visualization; the most commonly used being graphics and grid. (For more resources related to this topic, see here.) The graphics package is often referred to as the base or traditional graphics environment, since historically it was already available among the default packages delivered with the base installation of R and it provides functions that allow to the generation of complete plots. The grid package developed by Paul Murrell, on the other side, provides an alternative set of graphics tools. This package does not provide directly functions that generate complete plots, so it is not frequently used directly for generating graphics, but it was used in the development of advanced data visualization packages. Among the grid-based packages, the most widely used are lattice and ggplot2, although they are built implementing different visualization approaches. In fact lattice was build implementing the Trellis plots, while ggplot2 was build implementing the grammar of graphics. A diagram representing the connections between the tools just mentioned is represented in the Figure 1. Figure 1: Overview of the most widely used R packages for graphics Just keep in mind that this is not a complete overview of the packages available, but simply a small snapshot on the main packages used for data visualization in R, since many other packages are built on top of the tools just mentioned. If you would like to get a more complete overview of the graphics tools available in R, you may have a look at the web page of the R project summarizing such tools, http://cran.r-project.org/web/views/Graphics.html. ggplot2 and the Grammar of Graphics The package ggplot2 was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As in the case of lattice, this package is also based on grid, providing a series of high-level functions which allow the creation of complete plots. The ggplot2 package provides an interpretation and extension of the principles of the book The Grammar of Graphics by Leland Wilkinson. Briefly, the Grammar of Graphics assumes that a statistical graphic is a mapping of data to aesthetic attributes and geometric objects used to represent the data, like points, lines, bars, and so on. Together with the aesthetic attributes, the plot can also contain statistical transformation or grouping of the data. As in lattice, also in ggplot2 we have the possibility to split data by a certain variable obtaining a representation of each subset of data in an independent sub-plot; such representation in ggplot2 is called faceting. In a more formal way, the main components of the grammar of graphics are: the data and their mapping, the aesthetic, the geometric objects, the statistical transformations, scales, coordinates and faceting. A more detailed description of these elements is provided along the book ggplot2 Essentials, but this is a summary of the general principles The data that must be visualized are mapped to aesthetic attributes which define how the data should be perceived The geometric objects describe what is actually represented on the plot like lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw The statistical transformations are transformations which are applied to the data to group them; an example of statistical transformations would be, for instance, the smooth line or the regression lines of the previous examples or the binning of the histograms. Scales represent the connection between the aesthetic spaces with the actual values which should be represented. Scales maybe also be used to draw legends The coordinates represent the coordinate system in which the data are drawn The faceting, which we have already mentioned, is a grouping of data in subsets defined by a value of one variable In ggplot2 there are two main high-level functions, capable of creating directly creating a plot, qplot() and ggplot(); qplot() stands for quick plot and it is a simple function with serve a similar purpose to the plot() function in graphics. The function ggplot() on the other side is a much more advanced function which allow the user to have a deep control of the plot layout and details. In this article we will see some examples of qplot() in order to provide you with a taste of the typical plots which can be realized with ggplot2, but for more advanced data visualization the function ggplot(), is much more flexible. If you have a look on the different forums of R programming, there is quite some discussion about which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plot, where basically only the data should be represented and some minor modification of standard layout, the qplot() function will do the job. On the other side, if you would need to apply particular transformations to the data or simply if you would like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend to focus in learning the code of ggplot(). In the code below you will see an example of plot realized with ggplot2 where you can identify some of the components of the grammar of graphics. The example is realized with the function ggplot() which allow a more direct comparison with the grammar, but just below you may also find the corresponding code for the use of qplot(). Both codes generate the graph depicted on Figure 2. require(ggplot2) ## Load ggplot2 data(Orange) # Load the data ggplot(data=Orange, ## Data used aes(x=circumference,y=age, color=Tree))+ ##mapping to aesthetic geom_point()+ ##Add geometry (plot with data points) stat_smooth(method="lm",se=FALSE) ##Add statistics(linear regression) ### Corresponding code with qplot() qplot(circumference,age,data=Orange, ## Data used color=Tree, ## Aestetic mapping geom=c("point","smooth"),method="lm",se=FALSE) This simple example can give you an idea of the role of each portion of code in a ggplot2 graph; you have seen how the main function body create the connection between the data and the aesthetic we are interested to represent and how, on top of this, you add the components of the plot like in this case the geometry element of points and the statistical element of regression. You can also notice how the components which need to be added to the main function call are included using the + sign. One more thing worth to mention at this point, is the if you run just the main body function in the ggplot() function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attributes, in this case geom_point(). This is perfectly in line with the grammar of graphics, since as we have seen the geometry represent the actual connection between the data and what is represented on the plot. Is in fact at this stage that we specify we are interested in having points representing the data, before that nothing was specified yet about which plot we were interested in drawing. Figure 2: Example of plot of Orange dataset with ggplot2 The qplot() function The qplot (quick plot) function is a basic high level function of ggplot2. The general syntax that you should use with this function is the following qplot(x, y, data, colour, shape, size, facets, geom, stat) where x and y represent the variables to plot (y is optional with a default value NULL) data define the dataset containing the variables colour, shape and size are the aesthetic arguments that can be mapped on additional variables facets define the optional faceting of the plot based on one variable contained in the dataset geom allows you to select the actual visualization of the data, which basically will represent the plot which will be generated. Possible values are point, line or boxplot, but we will see several different examples in the next pages stat define the statistics to be used on the data These options represents the most important options available in qplot(). You may find a descriptions of the other function arguments in the help page of the function accessible with ?qplot, or on the ggplot2 website under the following link http://docs.ggplot2.org/0.9.3/qplot.html. Most of the options just discussed can be applied to different types of plots, since most of the concepts of the grammar of graphics, embedded in the code, may be translated from one plot to the other. For instance, you may use the argument colour to do an aesthetics mapping to one variable; these same concepts can in example be applied to scatterplots as well as histograms. Exactly the same principle would be applied to facets, which can be used for splitting plots independently on the type of plot considered. Histograms and density plots Histograms are plots used to explore how one (or more) quantitative variables are distributed. To show some examples of histograms we will use the iris data. This dataset contains measurements in centimetres of the variables sepal length and width and petal length and width for 50 flowers from each of three species of the flower iris: iris setosa, versicolor, and virginica. You may find more details running ?iris. The geometric attribute used to produce histograms is simply by specifying geom=”histogram” in the qplot() function. This default histogram will represent the variable specified on the x axis while the y axis will represent the number of elements in each bin. One other very useful way of representing distributions is to look at the kernel density function, which will basically produce a sort of continuous histogram instead of different bins by estimating the probability density function. For example let’s plot the petal length of all the three species of iris as histogram and density plot. data(iris) ## Load data qplot(Petal.Length, data=iris, geom="histogram") ## Histogram qplot(Petal.Length, data=iris, geom="density") ## Density plot The output of this code is showed in Figure 3. Figure 3: Histogram (left) and density plot (right) As you can see in both plots of Figure 3, it appears that the data are not distributed homogenously, but there are at least two distinct distribution clearly separated. This is very reasonably due to a different distribution for one of the iris species. To try to verify if the two distributions are indeed related to specie differences, we could generate the same plot using aesthetic attributes and have a different colour for each subtype of iris. To do this, we can simply map the fill to the Species column in the dataset; also in this case we can do that for the histogram and the density plot too. Below you may see the code we built, and in Figure 4 the resulting output. qplot(Petal.Length, data=iris, geom="histogram", colour=Species, fill=Species) qplot(Petal.Length, data=iris, geom="density", colour=Species, fill=Species) Figure 4: Histogram (left) and density plot (right) with aesthetic attribute for colour and fill In the distribution we can see that the lower data are coming from the Setosa species, while the two other distributions are partly overlapping. Scatterplots Scatterplots are probably the most common plot, since they are usually used to display the relationship between two quantitative variables. When two variables are provided, ggplot2 will make a scatterplot by default. For our example on how to build a scatterplot, we will use a dataset called ToothGrowth, which is available in the base R installation. In this dataset are reported measurements of teeth length of 10 guinea pig for three different doses of vitamin C (0.5, 1, and 2 mg) delivered in two different ways, as orange juice or as ascorbic acid (a compound having vitamin C activity). You can find, as usual, details on these data on the dataset help page at ?ToothGrowth. We are interested in seeing how the length of the teeth changed for the different doses. We are not able to distinguish among the different guinea pigs, since this information is not contained in the data, so for the moment we will plot just all the data we have. So let’s load the dataset and do a basic plot of the dose vs. length. require(ggplot2) data(ToothGrowth) qplot(dose, len, data=ToothGrowth, geom="point") ##Alternative coding qplot(dose, len, data=ToothGrowth) The resulting plot is reproduced in Figure 5. As you have seen, the default plot generated, also without a geom argument, is the scatter plot, which is the default bivariate plot type. In this plot we may have an idea of the tendency the data have, for instance we see that the teeth length increase by increasing the amount of vitamin C intake. On the other side, we know that there are two different subgroups in our data, since the vitamin C was provided in two different ways, as orange juice or as ascorbic acid, so it could be interesting to check if these two groups behave differently. Figure 5: Scatterplot of length vs. dose of ToothGrowth data The first approach could be to have the data in two different colours. To do that we simply need to assign the colour attribute to the column sup in the data, which defines the way of vitamin intake. The resulting plot is in Figure 6. qplot(dose, len,data=ToothGrowth, geom="point", col=supp) We now can distinguish from which intake route come each data in the plot and it looks like the data from orange juice shown are a little higher compared to ascorbic acid, but to differentiate between them it is not really easy. We could then try with the facets, so that the data will be completely separated in two different sub-plots. So let´s see what happens. Figure 6: Scatterplot of length vs. dose of ToothGrowth with data in different colours depending on vitamin intake. qplot(dose, len,data=ToothGrowth, geom="point", facets=.~supp) In this new plot, showed in Figure 7, we definitely have a better picture of the data, since we can see how the tooth growth differs for the different intakes. As you have seen, in this simple example, you will find that the best visualization may be different depending on the data you have. In some cases grouping a variable with colours or dividing the data with faceting may give you a different idea about the data and their tendency. For instance in our case with the plot in Figure 7 we can see that the way how the tooth growth increase with dose seems to be different for the different intake routes. Figure 7: Scatterplot of length vs. dose of ToothGrowth with faceting One approach to see the general tendency of the data could be to include a smooth line to the graph. In this case in fact we can see that the growth in the case of the orange juice does not looks really linear, so a smooth line could be a nice way to catch this. In order to do that we simply add a smooth curve to the vector of geometry components in the qplot() function. qplot(dose, len,data=ToothGrowth, geom=c("point","smooth"), facets=.~supp) As you can see from the plot obtained (Figure 8) we now see not only clearly the different data thanks to the faceting, but we can also see the tendency of the data with respect to the dose administered. As you have seen, requiring for the smooth line in ggplot2 will also include a confidence interval in the plot. If you would like to not to have the confidence interval you may simply add the argument se=FALSE. Figure 8: Scatterplot of length vs. dose of ToothGrowth with faceting and smooth line Summary In this short article we have seen some basic concept of ggplot2, ranging from the basic principles in comparison with the other R packages for graphics, up to some basic plots as for instance histograms, density plots or scatterplots. In this case we have limited our example to the use of qplot(), which enable us to obtain plots with some easy commands, but on the other side, in order to have a full control of plot appearance as well as data representation the function ggplot() will provide you with much more advanced functionalities. You can find a more detailed description of these functions as well as of the different features of ggplot2 together illustrated in various examples in the book ggplot2 Essentials. Resources for Article: Further resources on this subject: Data Analysis Using R [article] Data visualization [article] Using R for Statistics, Research, and Graphics [article]

0
0
10526

article-image-understanding-elf-specimen

Packt

07 Jan 2016

21 min read

Understanding the ELF specimen

Packt

07 Jan 2016

21 min read

In this article by Ryan O'Neill, author of the book Learning Linux Binary Analysis, ELF will be discussed. In order to reverse-engineer Linux binaries, we must understand the binary format itself. ELF has become the standard binary format for UNIX and UNIX-Flavor OS's. Binary formats such as ELF are not generally a quick study, and to learn ELF requires some degree of application of the different components that you learn as you go. Programming things that perform tasks such as binary parsing will require learning some ELF, and in the act of programming such things, you will in-turn learn ELF better and more proficiently as you go along. ELF is often thought to be a dry and complicated topic to learn, and if one were to simply read through the ELF specs without applying them through the spirit of creativity, then indeed it would be. ELF is really quite an incredible composition of computer science at work, with program layout, program loading, dynamic linking, symbol table lookups, and many other tightly orchestrated components. (For more resources related to this topic, see here.) ELF section headers Now that we've looked at what program headers are, it is time to look at section headers. I really want to point out here the distinction between the two; I often hear people calling sections "segments" and vice versa. A section is not a segment. Segments are necessary for program execution, and within segments are contained different types of code and data which are separated within sections and these sections always exist, and usually they are addressable through something called section-headers. Section-headers are what make sections accessible, but if the section-headers are stripped (Missing from the binary), it doesn't mean that the sections are not there. Sections are just data or code. This data or code is organized across the binary in different sections. The sections themselves exist within the boundaries of the text and data segment. Each section contains either code or data of some type. The data could range from program data, such as global variables, or dynamic linking information that is necessary for the linker. Now, as mentioned earlier, every ELF object has sections, but not all ELF objects have section headers. Usually this is because the executable has been tampered with (Such as the section headers having been stripped so that debugging is harder). All of GNU's binutils, such as objcopy, objdump, and other tools such as gdb, rely on the section-headers to locate symbol information that is stored in the sections specific to containing symbol data. Without section-headers, tools such as gdb and objdump are nearly useless. Section-headers are convenient to have for granular inspection over what parts or sections of an ELF object we are viewing. In fact, section-headers make reverse engineering a lot easier, since they provide us with the ability to use certain tools that require them. If, for instance, the section-header table is stripped, then we can't access a section such as .dynsym, which contains imported/exported symbols describing function names and offsets/addresses. Even if a section-header table has been stripped from an executable, a moderate reverse engineer can actually reconstruct a section-header table (and even part of a symbol table) by getting information from certain program headers, since these will always exist in a program or shared library. We discussed the dynamic segment earlier and the different DT_TAGs that contain information about the symbol table and relocation entries. This is what a 32 bit ELF section-header looks like: typedef struct { uint32_t sh_name; // offset into shdr string table for shdr name uint32_t sh_type; // shdr type I.E SHT_PROGBITS uint32_t sh_flags; // shdr flags I.E SHT_WRITE|SHT_ALLOC Elf32_Addr sh_addr; // address of where section begins Elf32_Off sh_offset; // offset of shdr from beginning of file uint32_t sh_size; // size that section takes up on disk uint32_t sh_link; // points to another section uint32_t sh_info; // interpretation depends on section type uint32_t sh_addralign; // alignment for address of section uint32_t sh_entsize; // size of each certain entries that may be in section } Elf32_Shdr; Let's take a look at some of the most important section types, once again allowing room to study the ELF(5) man pages and the official ELF specification for more detailed information about the sections. .text The .text section is a code section that contains program code instructions. In an executable program where there are also phdr, this section would be within the range of the text segment. Because it contains program code, it is of the section type SHT_PROGBITS. .rodata The rodata section contains read-only data, such as strings, from a line of C code: printf("Hello World!n"); These are stored in this section. This section is read-only, and therefore must exist in a read-only segment of an executable. So, you would find .rodata within the range of the text segment (Not the data segment). Because this section is read-only, it is of the type SHT_PROGBITS. .plt The Procedure linkage table (PLT) contains code that is necessary for the dynamic linker to call functions that are imported from shared libraries. It resides in the text segment and contains code, so it is marked as type SHT_PROGBITS. .data The data section, not to be confused with the data segment, will exist within the data segment and contain data such as initialized global variables. It contains program variable data, so it is marked as SHT_PROGBITS. .bss The bss section contains uninitialized global data as part of the data segment, and therefore takes up no space on the disk other than 4 bytes, which represents the section itself. The data is initialized to zero at program-load time, and the data can be assigned values during program execution. The bss section is marked as SHT_NOBITS, since it contains no actual data. .got The Global offset table (GOT) section contains the global offset table. This works together with the PLT to provide access to imported shared library functions, and is modified by the dynamic linker at runtime. This section has to do with program execution and is therefore marked as SHT_PROGBITS. .dynsym The dynsym section contains dynamic symbol information imported from shared libraries. It is contained within the text segment and is marked as type SHT_DYNSYM. .dynstr The dynstr section contains the string table for dynamic symbols; this has the name of each symbol in a series of null terminated strings. .rel.* Relocation sections contain information about how the parts of an ELF object or process image need to be fixed up or modified at linking or runtime. .hash The hash section, sometimes called .gnu.hash, contains a hash table for symbol lookup. The following hash algorithm is used for symbol name lookups in Linux ELF: uint32_t dl_new_hash (const char *s) { uint32_t h = 5381; for (unsigned char c = *s; c != ' '; c = *++s) h = h * 33 + c; return h; } .symtab The symtab section contains symbol information of the type ElfN_Sym. The symtab section is marked as type SHT_SYMTAB as it contains symbol information. .strtab This section contains the symbol string table that is referenced by the st_name entries within the ElfN_Sym structs of .symtab, and is marked as type SHT_STRTAB since it contains a string table. .shstrtab The shstrtab section contains the section header string table, which is a set of null terminated strings containing the names of each section, such as .text, .data, and so on. This section is pointed to by the ELF file header entry called e_shstrndx, which holds the offset of .shstrtab. This section is marked as SHT_STRTAB since it contains a string table. .ctors and .dtors The .ctors (constructors) and .dtors (destructors) sections contain code for initialization and finalization, which is to be executed before and after the actual main() body of program code, and then after the main program code. The __constructor__ function attribute is often used by hackers and virus writers to implement a function that performs an anti-debugging trick, such as calling PTRACE_TRACEME, so that the process traces itself and no debuggers can attach themselves to it. This way, the anti-debugging mechanism gets executed before the program enters main(). There are many other section names and types, but we have covered most of the primary ones found in a dynamically linked executable. One can now visualize how an executable is laid out with both phdrs and shdrs: ELF Relocations From the ELF(5) man pages: Relocation is the process of connecting symbolic references with symbolic definitions. Relocatable files must have information that describes how to modify their section contents, thus allowing executable and shared object files to hold the right information for a process's program image. Relocation entries are these data. The process of relocation relies on symbols, which is why we covered symbols first. An example of relocation might be a couple of relocatable objects (ET_REL) being linked together to create an executable. obj1.o wants to call a function, foo(), located in obj2.o. Both obj1.o and obj2.o are being linked to create a fully working executable; they are currently Position independent code (PIC), but once relocated to form an executable, they will no longer be position independent since symbolic references will be resolved into symbolic definitions. The term "relocated" means exactly that: a piece of code or data is being relocated from a simple offset in an object file to some memory address location in an executable, and anything that references that relocated code or data must also be adjusted. Let's take a quick look at a 32 bit relocation entry: typedef struct { Elf32_Addr r_offset; uint32_t r_info; } Elf32_Rel; And some relocation entries require an addend: typedef struct { Elf32_Addr r_offset; uint32_t r_info; int32_t r_addend; } Elf32_Rela; Following is the description of the preceding snippet: r_offset: This points to the location (offset or address) that requires the relocation action (which is going to be some type of modification) r_info: This gives both the symbol table index with respect to which the relocation must be made, and the type of relocation to apply r_addend: This specifies a constant addend used to compute the value stored in the relocatable field Let's take a look at the source code for obj1.o: _start() { foo(); } We see that it calls the function foo(), however foo() is not located within the source code or the compiled object file, so there will be a relocation entry necessary for symbolic reference: ryan@alchemy:~$ objdump -d obj1.o obj1.o: file format elf32-i386 Disassembly of section .text: 00000000 <func>: 0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 83 ec 08 sub $0x8,%esp 6: e8 fc ff ff ff call 7 <func+0x7> b: c9 leave c: c3 ret As we can see, the call to foo() is highlighted and simply calls to nowhere; 7 is the offset of itself. So, when obj1.o, which calls foo() (located in obj2.o), is linked with obj2.o to make an executable, a relocation entry is there to point at offset 7, which is the data that needs to be modified, changing it to the offset of the actual function, foo(), once the linker knows its location in the executable during link time: ryan@alchemy:~$ readelf -r obj1.o Relocation section '.rel.text' at offset 0x394 contains 1 entries: Offset Info Type Sym.Value Sym. Name 00000007 00000902 R_386_PC32 00000000 foo As we can see, a relocation field at offset 7 is specified by the relocation entry's r_offset field. R_386_PC32 is the relocation type; to understand all of these types, read the ELF specs as we will only be covering some. Each relocation type requires a different computation on the relocation target being modified. R_386_PC32 says to modify the target with S + A – P. The following list explains all these terms: S is the value of the symbol whose index resides in the relocation entry A is the addend found in the relocation entry P is the place (section offset or address) where the storage unit is being relocated (computed using r_offset) If we look at the final output of our executable after compiling obj1.o and obj2.o, as shown in the following code snippet: ryan@alchemy:~$ gcc -nostdlib obj1.o obj2.o -o relocated ryan@alchemy:~$ objdump -d relocated test: file format elf32-i386 Disassembly of section .text: 080480d8 <func>: 80480d8: 55 push %ebp 80480d9: 89 e5 mov %esp,%ebp 80480db: 83 ec 08 sub $0x8,%esp 80480de: e8 05 00 00 00 call 80480e8 <foo> 80480e3: c9 leave 80480e4: c3 ret 80480e5: 90 nop 80480e6: 90 nop 80480e7: 90 nop 080480e8 <foo>: 80480e8: 55 push %ebp 80480e9: 89 e5 mov %esp,%ebp 80480eb: 5d pop %ebp 80480ec: c3 ret We can see that the call instruction (the relocation target) at 0x80480de has been modified with the 32 bit offset value of 5, which points to foo(). The value 5 is the result of the R386_PC_32 relocation action: S + A – P: 0x80480e8 + 0xfffffffc – 0x80480df = 5 0xfffffffc is the same as -4 if a signed integer, so the calculation can also be seen as: 0x80480e8 + (0x80480df + sizeof(uint32_t)) To calculate an offset into a virtual address, use the following computation: address_of_call + offset + 5 (Where 5 is the length of the call instruction) Which in this case is 0x80480de + 5 + 5 = 0x80480e8. An address may also be computed into an offset with the following computation: address – address_of_call – 4 (Where 4 is the length of a call instruction – 1) Relocatable code injection based binary patching Relocatable code injection is a technique that hackers, virus writers, or anyone who wants to modify the code in a binary may utilize as a way to sort of re-link a binary after it has already been compiled. That is, you can inject an object file into an executable, update the executables symbol table, and perform the necessary relocations on the injected object code so that it becomes a part of the executable. A complicated virus might use this rather than just appending code at the end of an executable or finding existing padding. This technique requires extending the text segment to create enough padding room to load the object file. The real trick though is handling the relocations and applying them properly. I designed a custom reverse engineering tool for ELF that is named Quenya. Quenya has many features and capabilities, and one of them is to inject object code into an executable. Why do this? Well, one reason would be to inject a malicious function into an executable, and then hijack a legitimate function and replace it with the malicious one. From a security point of view, one could do hot-patching and apply a legitimate patch to a binary rather than doing something malicious. Let's pretend we are an attacker and we want to infect a program that calls puts() to print "Hello World", and our goal is to hijack puts() so that it calls evil_puts(). First, we would need to write a quick PIC object that can write a string to standard output: #include <sys/syscall.h> int _write (int fd, void *buf, int count) { long ret; __asm__ __volatile__ ("pushl %%ebxnt" "movl %%esi,%%ebxnt" "int $0x80nt" "popl %%ebx":"=a" (ret) :"0" (SYS_write), "S" ((long) fd), "c" ((long) buf), "d" ((long) count)); if (ret >= 0) { return (int) ret; } return -1; } int evil_puts(void) { _write(1, "HAHA puts() has been hijacked!n", 31); } Now, we compile evil_puts.c into evil_puts.o, and inject it into our program, hello_world: ryan@alchemy:~/quenya$ ./hello_world Hello World This program calls the following: puts(“Hello Worldn”); We now use Quenya to inject and relocate our evil_puts.o file into hello_world: [Quenya v0.1@alchemy] reloc evil_puts.o hello_world 0x08048624 addr: 0x8048612 0x080485c4 _write addr: 0x804861e 0x080485c4 addr: 0x804868f 0x080485c4 addr: 0x80486b7 Injection/Relocation succeeded As we can see, the function write() from our evil_puts.o has been relocated and assigned an address at 0x804861e in the executable, hello_world. The next command, hijack, overwrites the global offset table entry for puts() with the address of evil_puts(): [Quenya v0.1@alchemy] hijack binary hello_world evil_puts puts Attempting to hijack function: puts Modifying GOT entry for puts Succesfully hijacked function: puts Commiting changes into executable file [Quenya v0.1@alchemy] quit And Whammi! ryan@alchemy:~/quenya$ ./hello_world HAHA puts() has been hijacked! We have successfully relocated an object file into an executable and modified the executable's control flow so that it executes the code that we injected. If we use readelf -s on hello_world, we can actually now see a symbol called evil_puts(). For the readers interest, I have included a small snippet of code that contains the ELF relocation mechanics in Quenya; it may be a little bit obscure without knowledge of the rest of the code base, but it is also somewhat straightforward if you've paid attention to what we learned about relocations. It is just a snippet and does not show any of the other important aspects such as modifying the executables symbol table: case SHT_RELA: for (j = 0; j < obj.shdr[i].sh_size / sizeof(Elf32_Rela); j++, rela++) { rela = (Elf32_Rela *)(obj.mem + obj.shdr[i].sh_offset); /* symbol table */ symtab = (Elf32_Sym *)obj.section[obj.shdr[i].sh_link]; /* symbol we are applying relocation to */ symbol = &symtab[ELF32_R_SYM(rela->r_info)]; /* section to modify */ TargetSection = &obj.shdr[obj.shdr[i].sh_info]; TargetIndex = obj.shdr[i].sh_info; /* target location */ TargetAddr = TargetSection->sh_addr + rela->r_offset; /* pointer to relocation target */ RelocPtr = (Elf32_Addr *)(obj.section[TargetIndex] + rela->r_offset); /* relocation value */ RelVal = symbol->st_value; RelVal += obj.shdr[symbol->st_shndx].sh_addr; switch (ELF32_R_TYPE(rela->r_info)) { /* R_386_PC32 2 word32 S + A - P */ case R_386_PC32: *RelocPtr += RelVal; *RelocPtr += rela->r_addend; *RelocPtr -= TargetAddr; break; /* R_386_32 1 word32 S + A */ case R_386_32: *RelocPtr += RelVal; *RelocPtr += rela->r_addend; break; } } As shown in the preceding code, the relocation target that RelocPtr points to is modified according to the relocation action requested by the relocation type (such as R_386_32). Although relocatable code binary injection is a good example of the idea behind relocations, it is not a perfect example of how a linker actually performs it with multiple object files. Nevertheless, it still retains the general idea and application of a relocation action. Later on, we will talk about shared library (ET_DYN) injection, which brings us now to the topic of dynamic linking. Summary In this article we discussed different types of ELF section headers and ELF relocations. Resources for Article: Further resources on this subject: Advanced User Management [article] Creating Code Snippets [article] Advanced Wireless Sniffing [article]

0
0
10445

article-image-querying-and-filtering-data

Packt

25 Jun 2015

28 min read

Querying and Filtering Data

Packt

25 Jun 2015

28 min read

0
0
10430

article-image-create-treemap-packed-bubble-chart-tableau

Sugandha Lahoti

02 Jan 2018

4 min read

How to create a Treemap and Packed Bubble Chart in Tableau

Sugandha Lahoti

02 Jan 2018

4 min read

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Shweta Sankhe-Savale titled Tableau Cookbook – Recipes for Data Visualization. This cookbook has simple recipes for creating visualizations in Tableau. It covers the fundamentals of data visualization such as getting familiarized with Tableau Desktop and also goes to more complex problems like creating dynamic analytics with parameters, and advanced calculations.[/box] In today’s tutorial, we will learn how to create a Treemap and a packed Bubble chart in Tableau. Treemaps Treemaps are useful for representing hierarchical (tree-structured) data as a part-to-whole relationship. It shows data as a set of nested rectangles, and each branch of the tree is given a rectangle, which represents the amount of data it comprises. These can then be further divided into smaller rectangles that represent sub branches, based on its proportion to the whole. We can show information via the color and size of the rectangles and find out patterns that would be difficult to spot in other ways. They make efficient use of the space and hence can display a lot of items in a single visualization simultaneously. Getting Ready We will create a Treemap to show the sales and profit across various product subcategories. Let's see how to create a Treemap. How to Do it We will first create a new sheet and rename it as Treemap. Next, we will drag Sales from the Measures pane and drop it into the Size shelf. We will then drag Profit from Measures pane and drop it into the Color shelf. Our Mark type will automatically change to show squares. Refer to the following image: 5. Next, we will drop Sub-Category into the Label shelf in the Marks card, and we will get the output as shown in the following image: How it Works In the preceding image, since we have placed Sales in the Size shelf, we are inferring this: the greater the size, the higher the sales value; the smaller the size, the smaller the sales value. Since the Treemap is sorted in descending order of Size, we will see the biggest block in the top left-hand side corner and the smaller block in the bottom right-hand side corner. Further, we placed Profit in the Color shelf. There are some subcategories where the profit is negative and hence Tableau selects the orange/blue diverging color. Thus, when the color blue is the darkest, it indicates the Most profit. However, the orange color indicates that a particular subcategory is in a loss scenario. So, in the preceding chart, Phones has the maximum number of sales. Further, Copiers has the highest profit. Tables, on the other hand, is non-profitable. Packed Bubble Charts A Packed bubble chart is a cluster of circles where we use dimensions to define individual bubbles, and the size and/or color of the individual circles represent measures. Bubble charts have many benefits and one of them is to let us spot categories easily and compare them to the rest of the data by looking at the size of the bubble. This simple data visualization technique can provide insight in a visually attractive format. The Packed Bubble chart in Tableau uses the Circle mark type. Getting Ready To create a packed bubble chart, we will continue with the same example that we saw in the Treemap recipe. In the following section, we will see how we can convert the Treemap we created earlier into a Packed Bubble chart. How to Do it Let us duplicate the Tree Map sheet name and rename it to Packed Bubble chart. Next, change the marks from Square to Circle from the Marks dropdown in the Marks card. The output will be as shown in the following image: How it works In the Packed Bubble chart, there is no specific sort of order for Bubbles. The size and/or color are what defines the chart; the bigger or darker the circle, the greater the value. So, in the preceding example, we have Sales in the Size shelf, Profit in the Color shelf, and Sub-Category in the Label shelf. Thus, when we look at it, we understand that Phones has the most sales. Further, Copiers has the highest profit. Tables, on the other hand, is non-profitable even though the size indicates that the sales are fairly good. We saw two ways to visualize data by using Treemap and Packed Bubble chart types in Tableau. If you found this post is useful, do check out the book Tableau Cookbook – Recipes for Data Visualization to create more such charts, interactive dashboards and other beautiful data visualizations with Tableau.

0
0
10426

How-To Tutorials - Data

Highlights from Mary Meeker’s 2019 Internet trends report

Big Data

Rotation Forest - A Classifier Ensemble Based on Feature Extraction

The plot function

Regression models in Weka

Entity Framework Code-First: Accessing Database Views and Stored Procedures

Dimensionality Reduction with Principal Component Analysis

Analyzing Textual Data using the NLTK Library

Monte Carlo Simulation and Options

Tech regulation to an extent of sentence jail: Australia’s ‘Sharing of Abhorrent Violent Material Bill’ to Warren’s ‘Corporate Executive Accountability Act’

Trending Topics

Troll Patrol Report: Amnesty International and Element AI use machine learning to understand online abuse against women

Introduction to ggplot2 and the plotting environments in R

Understanding the ELF specimen

Querying and Filtering Data

How to create a Treemap and Packed Bubble Chart in Tableau

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access