Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-perform-crud-operations-on-mongodb-with-php
Amey Varangaonkar
17 Mar 2018
6 min read
Save for later

Perform CRUD operations on MongoDB with PHP

Amey Varangaonkar
17 Mar 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book Mastering MongoDB 3.x authored by Alex Giamas. This book covers the key concepts, and tips & tricks needed to build fault-tolerant applications in MongoDB. It gives you the power to become a true expert when it comes to the world’s most popular NoSQL database.[/box] In today’s tutorial, we will cover the CRUD (Create, Read, Update and Delete) operations using the popular PHP language with the official MongoDB driver. Create and delete operations To perform the create and delete operations, run the following code: $document = array( "isbn" => "401", "name" => "MongoDB and PHP" ); $result = $collection->insertOne($document); var_dump($result); This is the output: MongoDBInsertOneResult Object ( [writeResult:MongoDBInsertOneResult:private] => MongoDBDriverWriteResult Object ( [nInserted] => 1 [nMatched] => 0 [nModified] => 0 [nRemoved] => 0 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) [insertedId:MongoDBInsertOneResult:private] => MongoDBBSONObjectID Object ( [oid] => 5941ac50aabac9d16f6da142 ) [isAcknowledged:MongoDBInsertOneResult:private] => 1 ) The rather lengthy output contains all the information that we may need. We can get the ObjectId of the document inserted; the number of inserted, matched, modified, removed, and upserted documents by fields prefixed with n; and information about writeError or writeConcernError. There are also convenience methods in the $result object if we want to get the Information: $result->getInsertedCount(): To get the number of inserted objects $result->getInsertedId(): To get the ObjectId of the inserted document We can also use the ->insertMany() method to insert many documents at once, like this: $documentAlpha = array( "isbn" => "402", "name" => "MongoDB and PHP, 2nd Edition" ); $documentBeta = array( "isbn" => "403", "name" => "MongoDB and PHP, revisited" ); $result = $collection->insertMany([$documentAlpha, $documentBeta]); print_r($result); The result is: ( [writeResult:MongoDBInsertManyResult:private] => MongoDBDriverWriteResult Object ( [nInserted] => 2 [nMatched] => 0 [nModified] => 0 [nRemoved] => 0 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) [insertedIds:MongoDBInsertManyResult:private] => Array ( [0] => MongoDBBSONObjectID Object ( [oid] => 5941ae85aabac9d1d16c63a2 ) [1] => MongoDBBSONObjectID Object ( [oid] => 5941ae85aabac9d1d16c63a3 ) ) [isAcknowledged:MongoDBInsertManyResult:private] => 1 ) Again, $result->getInsertedCount() will return 2, whereas $result->getInsertedIds() will return an array with the two newly created ObjectIds: array(2) { [0]=> object(MongoDBBSONObjectID)#13 (1) { ["oid"]=> string(24) "5941ae85aabac9d1d16c63a2" } [1]=> object(MongoDBBSONObjectID)#14 (1) { ["oid"]=> string(24) "5941ae85aabac9d1d16c63a3" } } Deleting documents is similar to inserting but with the deleteOne() and deleteMany() methods; an example of deleteMany() is shown here: $deleteQuery = array( "isbn" => "401"); $deleteResult = $collection->deleteMany($deleteQuery); print_r($result); print($deleteResult->getDeletedCount()); Here is the output: MongoDBDeleteResult Object ( [writeResult:MongoDBDeleteResult:private] => MongoDBDriverWriteResult Object ( [nInserted] => 0 [nMatched] => 0 [nModified] => 0 [nRemoved] => 2 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) [isAcknowledged:MongoDBDeleteResult:private] => 1 ) 2 In this example, we used ->getDeletedCount() to get the number of affected documents, which is printed out in the last line of the output. Bulk write The new PHP driver supports the bulk write interface to minimize network calls to MongoDB: $manager = new MongoDBDriverManager('mongodb://localhost:27017'); $bulk = new MongoDBDriverBulkWrite(array("ordered" => true)); $bulk->insert(array( "isbn" => "401", "name" => "MongoDB and PHP" )); $bulk->insert(array( "isbn" => "402", "name" => "MongoDB and PHP, 2nd Edition" )); $bulk->update(array("isbn" => "402"), array('$set' => array("price" => 15))); $bulk->insert(array( "isbn" => "403", "name" => "MongoDB and PHP, revisited" )); $result = $manager->executeBulkWrite('mongo_book.books', $bulk); print_r($result); The result is: MongoDBDriverWriteResult Object ( [nInserted] => 3 [nMatched] => 1 [nModified] => 1 [nRemoved] => 0 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) In the preceding example, we executed two inserts, one update, and a third insert in an ordered fashion. The WriteResult object contains a total of three inserted documents and one modified document. The main difference compared to simple create/delete queries is that executeBulkWrite() is a method of the MongoDBDriverManager class, which we instantiate on the first line. Read operation Querying an interface is similar to inserting and deleting, with the findOne() and find() methods used to retrieve the first result or all results of a query: $document = $collection->findOne( array("isbn" => "101") ); $cursor = $collection->find( array( "name" => new MongoDBBSONRegex("mongo", "i") ) ); In the second example, we are using a regular expression to search for a key name with the value mongo (case-insensitive). Embedded documents can be queried using the . notation, as with the other languages that we examined earlier in this chapter: $cursor = $collection->find( array('meta.price' => 50) ); We do this to query for an embedded document price inside the meta key field. Similarly to Ruby and Python, in PHP we can query using comparison operators, like this: $cursor = $collection->find( array( 'price' => array('$gte'=> 60) ) ); Querying with multiple key-value pairs is an implicit AND, whereas queries using $or, $in, $nin, or AND ($and) combined with $or can be achieved with nested queries: $cursor = $collection->find( array( '$or' => array( array("price" => array( '$gte' => 60)), array("price" => array( '$lte' => 20)) ))); This finds documents that have price>=60 OR price<=20. Update operation Updating documents has a similar interface with the ->updateOne() OR ->updateMany() method. The first parameter is the query used to find documents and the second one will update our documents. We can use any of the update operators explained at the end of this chapter to update in place or specify a new document to completely replace the document in the query: $result = $collection->updateOne( array( "isbn" => "401"), array( '$set' => array( "price" => 39 ) ) ); We can use single quotes or double quotes for key names, but if we have special operators starting with $, we need to use single quotes. We can use array( "key" => "value" ) or ["key" => "value"]. We prefer the more explicit array() notation in this book. The ->getMatchedCount() and ->getModifiedCount() methods will return the number of documents matched in the query part or the ones modified from the query. If the new value is the same as the existing value of a document, it will not be counted as modified. We saw, it is fairly easy and advantageous to use PHP as a language and tool for performing efficient CRUD operations in MongoDB to handle data efficiently. If you are interested to get more information on how to effectively handle data using MongoDB, you may check out this book Mastering MongoDB 3.x.
Read more
  • 0
  • 0
  • 16145

article-image-implementing-apache-spark-k-means-clustering-method-on-digital-breath-test-data-for-road-safety
Savia Lobo
01 Mar 2018
7 min read
Save for later

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety

Savia Lobo
01 Mar 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. In this book, you will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks.[/box] In today’s tutorial, we have used the Road Safety test data from our previous article, to show how one can attempt to find clusters in data using K-Means algorithm with Apache Spark MLlib. Theory on Clustering The K-Means algorithm iteratively attempts to determine clusters within the test data by minimizing the distance between the mean value of cluster center vectors, and the new candidate cluster member vectors. The following equation assumes dataset members that range from X1 to Xn; it also assumes K cluster sets that range from S1 to Sk, where K <= n. K-Means in practice The K-Means MLlib functionality uses the LabeledPoint structure to process its data and so it needs numeric input data. As the same data from the last section is being reused, we will not explain the data conversion again. The only change that has been made in data terms in this section, is that processing in HDFS will now take place under the /data/spark/kmeans/ directory. Additionally, the conversion Scala script for the K-Means example produces a record that is all comma-separated. The development and processing for the K-Means example has taken place under the /home/hadoop/spark/kmeans directory to separate the work from other development. The sbt configuration file is now called kmeans.sbt and is identical to the last example, except for the project name: name := "K-Means" The code for this section can be found in the software package under chapter7K-Means. So, looking at the code for kmeans1.scala, which is stored under kmeans/src/main/scala, some similar actions occur. The import statements refer to the Spark context and configuration. This time, however, the K-Means functionality is being imported from MLlib. Additionally, the application class name has been changed for this example to kmeans1: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.clustering.{KMeans,KMeansModel} object kmeans1 extends App { The same actions are being taken as in the last example to define the data file--to define the Spark configuration and create a Spark context: val hdfsServer = "hdfs://localhost:8020" val hdfsPath      = "/data/spark/kmeans/" val dataFile     = hdfsServer + hdfsPath + "DigitalBreathTestData2013- MALE2a.csv" val sparkMaster = "spark://localhost:7077" val appName = "K-Means 1" val conf = new SparkConf() conf.setMaster(sparkMaster) conf.setAppName(appName) val sparkCxt = new SparkContext(conf) Next, the CSV data is loaded from the data file and split by comma characters into the VectorData variable: val csvData = sparkCxt.textFile(dataFile) val VectorData = csvData.map { csvLine => Vectors.dense( csvLine.split(',').map(_.toDouble)) } A KMeans object is initialized, and the parameters are set to define the number of clusters and the maximum number of iterations to determine them: val kMeans = new KMeans val numClusters       = 3 val maxIterations     = 50 Some default values are defined for the initialization mode, number of runs, and Epsilon, which we needed for the K-Means call but did not vary for the processing. Finally, these parameters were set against the KMeans object: val initializationMode = KMeans.K_MEANS_PARALLEL val numRuns     = 1 val numEpsilon       = 1e-4 kMeans.setK( numClusters ) kMeans.setMaxIterations( maxIterations ) kMeans.setInitializationMode( initializationMode ) kMeans.setRuns( numRuns ) kMeans.setEpsilon( numEpsilon ) We cached the training vector data to improve the performance and trained the KMeans object using the vector data to create a trained K-Means model: VectorData.cache val kMeansModel = kMeans.run( VectorData ) We have computed the K-Means cost and number of input data rows, and have output the results via println statements. The cost value indicates how tightly the clusters are packed and how separate the clusters are: val kMeansCost = kMeansModel.computeCost( VectorData ) println( "Input data rows : " + VectorData.count() ) println( "K-Means Cost  : " + kMeansCost ) Next, we have used the K-Means Model to print the cluster centers as vectors for each of the three clusters that were computed: kMeansModel.clusterCenters.foreach{ println } Finally, we use the K-Means model predict function to create a list of cluster membership predictions. We then count these predictions by value to give a count of the data points in each cluster. This shows which clusters are bigger and whether there really are three clusters: val clusterRddInt = kMeansModel.predict( VectorData ) val clusterCount = clusterRddInt.countByValue clusterCount.toList.foreach{ println } } // end object kmeans1 So, in order to run this application, it must be compiled and packaged from the kmeans subdirectory as the Linux pwd command shows here: [hadoop@hc2nn kmeans]$ pwd /home/hadoop/spark/kmeans [hadoop@hc2nn kmeans]$ sbt package Loading /usr/share/sbt/bin/sbt-launch-lib.bash [info] Set current project to K-Means (in build file:/home/hadoop/spark/kmeans/) [info] Compiling 2 Scala sources to /home/hadoop/spark/kmeans/target/scala-2.10/classes... [info] Packaging /home/hadoop/spark/kmeans/target/scala-2.10/k- means_2.10-1.0.jar ... [info] Done packaging. [success] Total time: 20 s, completed Feb 19, 2015 5:02:07 PM Once this packaging is successful, we check HDFS to ensure that the test data is ready. As in the last example, we convert our data to numeric form using the convert.scala file, provided in the software package. We will process the DigitalBreathTestData2013- MALE2a.csv data file in the HDFS directory, /data/spark/kmeans, as follows: [hadoop@hc2nn nbayes]$ hdfs dfs -ls /data/spark/kmeans Found 3 items -rw-r--r--   3 hadoop supergroup 24645166 2015-02-05 21:11 /data/spark/kmeans/DigitalBreathTestData2013-MALE2.csv -rw-r--r--   3 hadoop supergroup 5694226 2015-02-05 21:48 /data/spark/kmeans/DigitalBreathTestData2013-MALE2a.csv drwxr-xr-x - hadoop supergroup   0 2015-02-05 21:46 /data/spark/kmeans/result The spark-submit tool is used to run the K-Means application. The only change in this command is that the class is now kmeans1: spark-submit --class kmeans1 --master spark://localhost:7077 --executor-memory 700M --total-executor-cores 100 /home/hadoop/spark/kmeans/target/scala-2.10/k-means_2.10-1.0.jar The output from the Spark cluster run is shown to be as follows: Input data rows : 467054 K-Means Cost  : 5.40312223450789E7 The previous output shows the input data volume, which looks correct; it also shows the K- Means cost value. The cost is based on the Within Set Sum of Squared Errors (WSSSE) which basically gives a measure how well the found cluster centroids are matching the distribution of the data points. The better they are matching, the lower the cost. The following link https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/ explains WSSSE and how to find a good value for k in more detail. Next come the three vectors, which describe the data cluster centers with the correct number of dimensions. Remember that these cluster centroid vectors will have the same number of columns as the original vector data: [0.24698249738061878,1.3015883142472253,0.005830116872250263,2.917374778855 5207,1.156645130895448,3.4400290524342454] [0.3321793984152627,1.784137241326256,0.007615970459266097,2.58319870759289 17,119.58366028156011,3.8379106085083468] [0.25247226760684494,1.702510963969387,0.006384899819416975,2.2314042480006 88,52.202897927594805,3.551509158139135] Finally, cluster membership is given for clusters 1 to 3 with cluster 1 (index 0) having the largest membership at 407539 member vectors: (0,407539) (1,12999) (2,46516) To summarize, we saw a practical  example that shows how K-means algorithm is used to cluster data with the help of Apache Spark. If you found this post useful, do check out this book Mastering Apache Spark 2.x - Second Edition to learn about the latest enhancements in Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.
Read more
  • 0
  • 1
  • 16136

article-image-getting-started-with-data-storytelling
Aaron Lazar
28 Jan 2018
11 min read
Save for later

Getting Started with Data Storytelling

Aaron Lazar
28 Jan 2018
11 min read
[box type="note" align="" class="" width=""]This article has been taken from the book Principles of Data Science, written by Sinan Ozdemir. It aims to practically introduce you to the different ways in which you can communicate or visualize your data to tell stories effectively.[/box] Communication matters Being able to conduct experiments and manipulate data in a coding language is not enough to conduct practical and applied data science. This is because data science is, generally, only as good as how it is used in practice. For instance, a medical data scientist might be able to predict the chance of a tourist contracting Malaria in developing countries with >98% accuracy, however, if these results are published in a poorly marketed journal and online mentions of the study are minimal, their groundbreaking results that could potentially prevent deaths would never see the true light of day. For this reason, communication of results through data storytelling is arguably as important as the results themselves. A famous example of poor management of distribution of results is the case of Gregor Mendel. Mendel is widely recognized as one of the founders of modern genetics. However, his results (including data and charts) were not well adopted until after his death. Mendel even sent them to Charles Darwin, who largely ignored Mendel's papers, which were written in unknown Moravian journals. Generally, there are two ways of presenting results: verbal and visual. Of course, both the verbal and visual forms of communication can be broken down into dozens of subcategories, including slide decks, charts, journal papers, and even university lectures. However, we can find common elements of data presentation that can make anyone in the field more aware and effective in their communication skills. Let's dive right into effective (and ineffective) forms of communication, starting with visuals. We’ll look at four basic types of graphs: scatter plots, line graphs, bar charts, histograms, and box plots. Scatter plots A scatter plot is probably one of the simplest graphs to create. It is made by creating two quantitative axes and using data points to represent observations. The main goal of a scatter plot is to highlight relationships between two variables and, if possible, reveal a correlation. For example, we can look at two variables: average hours of TV watched in a day and a 0-100 scale of work performance (0 being very poor performance and 100 being excellent performance). The goal here is to find a relationship (if it exists) between watching TV and average work performance. The following code simulates a survey of a few people, in which they revealed the amount of television they watched, on an average, in a day against a company-standard work performance metric: import pandas as pd hours_tv_watched = [0, 0, 0, 1, 1.3, 1.4, 2, 2.1, 2.6, 3.2, 4.1, 4.4, 4.4, 5] This line of code is creating 14 sample survey results of people answering the question of how many hours of TV they watch in a day. work_performance = [87, 89, 92, 90, 82, 80, 77, 80, 76, 85, 80, 75, 73, 72] This line of code is creating 14 new sample survey results of the same people being rated on their work performance on a scale from 0 to 100. For example, the first person watched 0 hours of TV a day and was rated 87/100 on their work, while the last person watched, on an average, 5 hours of TV a day and was rated 72/100: df = pd.DataFrame({'hours_tv_watched':hours_tv_watched, 'work_ performance':work_performance}) Here, we are creating a Dataframe in order to ease our exploratory data analysis and make it easier to make a scatter plot: df.plot(x='hours_tv_watched', y='work_performance', kind='scatter') Now, we are actually making our scatter plot. In the following plot, we can see that our axes represent the number of hours of TV watched in a day and the person's work performance metric: Each point on a scatter plot represents a single observation (in this case a person) and its location is a result of where the observation stands on each variable. This scatter plot does seem to show a relationship, which implies that as we watch more TV in the day, it seems to affect our work performance. Of course, as we are now experts in statistics from the last two chapters, we know that this might not be causational. A scatter plot may only work to reveal a correlation or an association between but not a causation. Advanced statistical tests, such as the ones we saw in Chapter 8, Advanced Statistics, might work to reveal causation. Later on in this chapter, we will see the damaging effects that trusting correlation might have. Line graphs Line graphs are, perhaps, one of the most widely used graphs in data communication. A line graph simply uses lines to connect data points and usually represents time on the x axis. Line graphs are a popular way to show changes in variables over time. The line graph, like the scatter plot, is used to plot quantitative variables. As a great example, many of us wonder about the possible links between what we see on TV and our behavior in the world. A friend of mine once took this thought to an extreme—he wondered if he could find a relationship between the TV show, The X-Files, and the amount of UFO sightings in the U.S.. He then found the number of sightings of UFOs per year and plotted them over time. He then added a quick graphic to ensure that readers would be able to identify the point in time when the X-files were released: It appears to be clear that right after 1993, the year of the X-Files premier, the number of UFO sightings started to climb drastically. This graphic, albeit light-hearted, is an excellent example of a simple line graph. We are told what each axis measures, we can quickly see a general trend in the data, and we can identify with the author's intent, which is to show a relationship between the number of UFO sightings and the X-files premier. On the other hand, the following is a less impressive line chart: This line graph attempts to highlight the change in the price of gas by plotting three points in time. At first glance, it is not much different than the previous graph—we have time on the bottom x axis and a quantitative value on the vertical y axis. The (not so) subtle difference here is that the three points are equally spaced out on the x axis; however, if we read their actual time indications, they are not equally spaced out in time. A year separates the first two points whereas a mere 7 days separates the last two points. Bar charts We generally turn to bar charts when trying to compare variables across different groups. For example, we can plot the number of countries per continent using a bar chart. Note how the x axis does not represent a quantitative variable, in fact, when using a bar chart, the x axis is generally a categorical variable, while the y axis is quantitative. Note that, for this code, I am using the World Health Organization's report on alcohol consumption around the world by country: drinks = pd.read_csv('data/drinks.csv') drinks.continent.value_counts().plot(kind='bar', title='Countries per Continent') plt.xlabel('Continent') plt.ylabel('Count') The following graph shows us a count of the number of countries in each continent. We can see the continent code at the bottom of the bars and the bar height represents the number of countries we have in each continent. For example, we see that Africa has the most countries represented in our survey, while South America has the least: In addition to the count of countries, we can also plot the average beer servings per continent using a bar chart, as shown: drinks.groupby('continent').beer_servings.mean().plot(kind='bar') Note how a scatter plot or a line graph would not be able to support this data because they can only handle quantitative variables; bar graphs have the ability to demonstrate categorical values. We can also use bar charts to graph variables that change over time, like a line graph. Histograms Histograms show the frequency distribution of a single quantitative variable by splitting up the data, by range, into equidistant bins and plotting the raw count of observations in each bin. A histogram is effectively a bar chart where the x axis is a bin (subrange) of values and the y axis is a count. As an example, I will import a store's daily number of unique customers, as shown: rossmann_sales = pd.read_csv('data/rossmann.csv') rossmann_sales.head() Note how we have multiple store data (by the first Store column). Let's subset this data for only the first store, as shown: first_rossmann_sales = rossmann_sales[rossmann_sales['Store']==1] Now, let's plot a histogram of the first store's customer count: first_rossmann_sales['Customers'].hist(bins=20) plt.xlabel('Customer Bins') plt.ylabel('Count') The x axis is now categorical in that each category is a selected range of values, for example, 600-620 customers would potentially be a category. The y axis, like a bar chart, is plotting the number of observations in each category. In this graph, for example, one might take away the fact that most of the time, the number of customers on any given day will fall between 500 and 700. Altogether, histograms are used to visualize the distribution of values that a quantitative variable can take on. Box plots Box plots are also used to show a distribution of values. They are created by plotting the five number summary, as follows: The minimum value The first quartile (the number that separates the 25% lowest values from the rest) The median The third quartile (the number that separates the 25% highest values from the rest) The maximum value In Pandas, when we create box plots, the red line denotes the median, the top of the box (or the right if it is horizontal) is the third quartile, and the bottom (left) part of the box is the first quartile. The following is a series of box plots showing the distribution of beer consumption according to continents: drinks.boxplot(column='beer_servings', by='continent') Now, we can clearly see the distribution of beer consumption across the seven continents and how they differ. Africa and Asia have a much lower median of beer consumption than Europe or North America. Box plots also have the added bonus of being able to show outliers much better than a histogram. This is because the minimum and maximum are parts of the box plot. Getting back to the customer data, let's look at the same store customer numbers, but using a box plot: first_rossmann_sales.boxplot(column='Customers', vert=False) This is the exact same data as plotted earlier in the histogram; however, now it is shown as a box plot. For the purpose of comparison, I will show you both the graphs one after the other: Note how the x axis for each graph are the same, ranging from 0 to 1,200. The box plot is much quicker at giving us a center of the data, the red line is the median, while the histogram works much better in showing us how spread out the data is and where people's biggest bins are. For example, the histogram reveals that there is a very large bin of zero people. This means that for a little over 150 days of data, there were zero customers. Note that we can get the exact numbers to construct a box plot using the describe feature in Pandas, as shown: first_rossmann_sales['Customers'].describe() min 0.000000 25% 463.000000 50% 529.000000 75% 598.750000 max 1130.000000 There we have it! We just learned data storytelling through various techniques like scatter plots, line graphs, bar charts, histograms and box plots. Now you’ve got the power to be creative in the way you tell tales of your data! If you found our article useful, you can check out Principles of Data Science for more interesting Data Science tips and techniques.    
Read more
  • 0
  • 0
  • 16096

article-image-hands-on-table-calculation-techniques-with-tableau
Amarabha Banerjee
14 Feb 2018
4 min read
Save for later

Hands on Table Calculation Techniques with Tableau

Amarabha Banerjee
14 Feb 2018
4 min read
[box type="note" align="" class="" width=""]This article is a book excerpt from the title Mastering Tableau written by David Baldwin. This book will help you master the intricacies of Tableau to create effective data visualizations.[/box] Today, we shall explore Table Calculation Techniques with Tableau and explore a real world example of using these techniques. In this article, we provide a simple schema for understanding table calculations. This schema is communicated via two questions: What is the function? How is the function applied? These two questions are inexorably connected. You cannot reliably apply something until you know what it is. And you cannot get useful results from something until you correctly apply it. The sections below will help you to get a head start into Table calculation techniques and how to use Tableau functions effectively for implementing Table calculation. Basics of Table Calculation Calculated fields can be categorized as: row level, aggregate level, and table level. For row- and aggregate-level calculations, the underlying data source engine does most (if not all) of the computational work, and Tableau merely visualizes the results. For table calculations, Tableau also relies on the underlying data source engine to execute computational tasks; however, after that work is completed and a dataset is returned, Tableau performs additional processing before rendering the results. This can be seen within the following process flow diagram, in the circled part titled Tableau performs additional processing. This is where table calculations are processed. We will continue with a definition: A table calculation is a function performed on a dataset in cache that has been generated as a result of a query from Tableau to the data source. Let's consider a couple of points regarding the dataset in cache mentioned in the preceding definition. First, it is important to understand that this cache is not simply the returned results of a query. Tableau may adjust the returned results. For example,, Tableau may expand the cache via data densification. Secondly, it's important to consider how the cache is structured. Basically, the dataset in cache is a table and, like all tables, is made up of rows and columns. This is particularly important for table calculations since a table calculation may be computed as it moves along the cache. Such a table calculation is directional. Alternatively, a table calculation may be computed based on the entire cache with no directional consideration. Table calculations such as this are non-directional. Directional and nondirectional table calculations will be considered more fully in the following section. Directional and non-directional table calculation functions As of Tableau 10, there are 32 table calculation functions in Tableau. However, many of these are simply variations of a theme; for example, there are five Running functions, including RUNNING_SUM and RUNNING_AVG. If we narrow our consideration to unique groups of table calculations functions, we will discover that there are only 11. The following table shows these 11 functions organized in two categories: As mentioned previously, non-directional table calculation functions operate on the entire cache and thus are not computed based on movement through the cache. For example, the SIZE function doesn't change based on the value of a previous row in the cache. On the other hand, RUNNING_SUM does change based on previous rows in the cache and is therefore considered directional. Practical Example In this example, we'll see directional and non-directional table calculation functions in action: Navigate to h t t p s ://p u b l i c . t a b l e a u . c o m /p r o f i l e /d a v i d 1. . b a l d w i n #!/ to locate and download the workbook associated with this chapter. Navigate to the worksheet entitled Directional/Non-Directional. Create the following calculated fields: Place Category and Ship Mode on the Rows shelf. Double-click on Sales, Lookup, Size, Window Sum, Window Sum w/Start&End, and Running Sum to populate the view. Compare the following screenshot with the notes in step 3 of this exercise for better understanding: We discussed techniques for implementing table calculations with Tableau. If you liked our post, be sure to check out Mastering Tableau which consists of other useful data visualization and data analysis techniques.    
Read more
  • 0
  • 0
  • 16059

article-image-support-vector-machines-classification-engine
Packt
17 Mar 2016
9 min read
Save for later

Support Vector Machines as a Classification Engine

Packt
17 Mar 2016
9 min read
In this article by Tomasz Drabas, author of the book, Practical Data Analysis Cookbook, we will discuss on how Support Vector Machine models can be used as a classification engine. (For more resources related to this topic, see here.) Support Vector Machines Support Vector Machines (SVMs) are a family of extremely powerful models that can be used in classification and regression problems. They aim at finding decision boundaries that separate observations with differing class memberships. While many classifiers exist that can classify linearly separable data (for example, logistic regression), SVMs can handle highly non-linear problems using a kernel trick that implicitly maps the input vectors to higher-dimensional feature spaces. The transformation rearranges the dataset in such a way that it is then linearly solvable. The mechanics of the machine Given a set of n points of a form (x1,y1)...(xn,yn), where xi is a z-dimensional input vector and  yi is a class label, the SVM aims at finding the maximum margin hyperplane that separates the data points: In a two-dimensional dataset, with linearly separable data points (as shown in the preceding figure), the maximum margin hyperplane would be a line that would maximize the distance between each of the classes. The hyperplane could be expressed as a dot product of the set of input vectors  x and a vector normal to the hyperplane W:W.X=b, where b is the offset from the origin of the coordinate system. To find the hyperplane, we solve the following optimization problem: The constraint of our optimization problem effectively states that no point can cross the hyperplane if it does not belong to the class on that side of the hyperplane. Linear SVM Building a linear SVM classifier in Python is easy. There are multiple Python packages that can estimate a linear SVM but here, we decided to use MLPY (http://mlpy.sourceforge.net): import pandas as pd import numpy as np import mlpy as ml First, we load the necessary modules that we will use later, namely pandas (http://pandas.pydata.org), NumPy (http://www.numpy.org), and the aforementioned MLPY. We use pandas to read the data (https://github.com/drabastomek/practicalDataAnalysisCookbook repository to download the data): # the file name of the dataset r_filename = 'Data/Chapter03/bank_contacts.csv' # read the data csv_read = pd.read_csv(r_filename) The dataset that we use was described in S. Moro, P. Cortez, and P. Rita. A data-driven approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 and found here http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. It consists of over 41.1k outbound marketing calls of a bank. Our aim is to classify these calls into two buckets: those that resulted in a credit application and those that did not. Once the file was loaded, we split the data into training and testing datasets; we also keep the input and class indicator data separately. To this end, we use the split_dataset(...) method: def split_data(data, y, x = 'All', test_size = 0.33): ''' Method to split the data into training and testing ''' import sys # dependent variable variables = {'y': y} # and all the independent if x == 'All': allColumns = list(data.columns) allColumns.remove(y) variables['x'] = allColumns else: if type(x) != list: print('The x parameter has to be a list...') sys.exit(1) else: variables['x'] = x # create a variable to flag the training sample data['train'] = np.random.rand(len(data)) < (1 - test_size) # split the data into training and testing train_x = data[data.train] [variables['x']] train_y = data[data.train] [variables['y']] test_x = data[~data.train][variables['x']] test_y = data[~data.train][variables['y']] return train_x, train_y, test_x, test_y, variables['x'] We randomly set 1/3 of the dataset aside for testing purposes and use the remaining 2/3 for the training of the model: # split the data into training and testing train_x, train_y, test_x, test_y, labels = hlp.split_data( csv_read, y = 'credit_application' ) Once we read the data and split it into training and testing datasets, we can estimate the model: # create the classifier object svm = ml.LibSvm(svm_type='c_svc', kernel_type='linear', C=100.0) # fit the data svm.learn(train_x,train_y) The svm_type parameter of the .LibSvm(...) method controls what algorithm to use to estimate the SVM. Here, we use the c_svc method—a C-support Vector Classifier. The method specifies how much you want to avoid misclassifying observations: the larger values of C parameter will shrink the margin for the hyperplane (theb) so that more of the observations are correctly classified. You can also specify nu_svc with a nu parameter that controls how much of your sample (at most) can be misclassified and how many of your observations (at least) can become support vectors. Here, we estimate an SVM with a linear kernel, so let's talk about kernels. Kernels A kernel function K is effectively a function that computes a dot product between two n-dimensional vectors, K: Rn.Rn --> R. In other words, the kernel function takes two vectors and produces a scalar: The linear kernel does not effectively transform the data into a higher dimensional space. This is not true for polynomial or Radial Basis Function (RBF) kernels that transform the input feature space into higher dimensions. In case of the polynomial kernel of degree d, the obtained feature space has (n+d/d) dimensions for the Rn dimensional input feature space. As you can see, the number of additional dimensions can grow very quickly and this would pose significant problems in estimating the model if we would explicitly transform the data into higher dimensions. Thankfully, we do not have to do this as that's where the kernel trick comes into play. The truth is that SVMs do not have to work explicitly in higher dimensions but can rather implicitly map the data to higher dimensions using pairwise inner products (instead of an explicit dot product) and then use it to find the maximum margin hyperplane. You can find a really good explanation of the kernel trick at http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html. Back to our example The .learn(...) method of the .LibSvm(...) object estimates the model. Once the model is estimated, we can test how well it performs. First, we use the estimated model to predict the classes for the observations in the testing dataset: predicted_l = svm.pred(test_x) Next, we will use some of the scikit-learn methods to print the basic statistics for our model: def printModelSummary(actual, predicted): ''' Method to print out model summaries ''' import sklearn.metrics as mt print('Overall accuracy of the model is {0:.2f} percent' .format( (actual == predicted).sum() / len(actual) * 100)) print('Classification report: n', mt.classification_report(actual, predicted)) print('Confusion matrix: n', mt.confusion_matrix(actual, predicted)) print('ROC: ', mt.roc_auc_score(actual, predicted)) First, we calculate the overall accuracy of the model expressed as a ratio of properly classified observations to the total number of observations in the testing sample. Next, we print the classification report: The precision is the model's ability to avoid classifying an observation as positive when it is not. It is a ratio of true positives to the overall number of positively classified records. The overall precision score is a weighted average of the individual precision scores where the weight is the support. The support is the total number of actual observations in each class. The total precision for our model is not too bad—89 out of 100. However, when we look at the precision to classify the true positives, the situation is not as good—only 63 out of 100 were properly classified. Recall can be viewed as the model's capacity to find all the positive samples. It is a ratio of true positives to the sum of true positives and false negatives. The recall for the class 0.0 is almost perfect but for class 1.0, it looks really bad. This might be a problem with the fact that our sample is not balanced, but it is more likely that the features we use to classify the data do not really capture the differences between the two groups. The f1-score is effectively a weighted amalgam of the precision and recall: it is a ratio of twice the product of precision and recall to their sum. In one measure, it shows whether the model performs well or not. At the general level, the model does not perform badly but when looked at the model's ability to classify the true signal, it fails gravely. It is a perfect example why judging the model at the general level might be misleading when dealing with samples that are heavily unbalanced. RBF kernel SVM Given that the linear kernel performed poorly, our dataset might not be linearly separable. Thus, let's try the RBF kernel. The RBF kernel is given as K(x,y)=e ||x-y||2/2a2, where ||x-y||2 is a Euclidean distance between the two vectors, x and y, and σ is a free parameter. The value of RBF equals to 1 when x=y and gradually falls to 0 when the distance approaches infinity. To fit an RBF version of our model, we can specify our svm object as follows: svm = ml.LibSvm(svm_type='c_svc', kernel_type='rbf', gamma=0.1, C=1.0) The gamma parameter here specifies how far the influence of a single support vector reaches. Visually, you can investigate the relationship between gamma and C parameters at http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html. The rest of the code for the model estimation follows in a similar fashion as with the linear kernel and we obtain the following results: The results are even worse than the linear kernel as the precision and recall were lost across the board. The SVM with the RBF kernel performed worse when classifying calls that resulted in applying for the credit card and those that did not. Summary In this article, we saw that the problem is not with the model but rather, the dataset that we use does not explain the variance sufficiently. This requires going back to the drawing board and selecting other features. Resources for Article: Further resources on this subject: Push your data to the Web [article] Transferring Data from MS Access 2003 to SQL Server 2008 [article] Exporting data from MS Access 2003 to MySQL [article]
Read more
  • 0
  • 0
  • 16014

article-image-managing-hadoop-cluster
Packt
30 Aug 2013
13 min read
Save for later

Managing a Hadoop Cluster

Packt
30 Aug 2013
13 min read
(For more resources related to this topic, see here.) From the perspective of functionality, a Hadoop cluster is composed of an HDFS cluster and a MapReduce cluster . The HDFS cluster consists of the default filesystem for Hadoop. It has one or more NameNodes to keep track of the filesystem metadata, while actual data blocks are stored on distributed slave nodes managed by DataNode. Similarly, a MapReduce cluster has one JobTracker daemon on the master node and a number of TaskTrackers on the slave nodes. The JobTracker manages the life cycle of MapReduce jobs. It splits jobs into smaller tasks and schedules the tasks to run by the TaskTrackers. A TaskTracker executes tasks assigned by the JobTracker in parallel by forking one or a number of JVM processes. As a Hadoop cluster administrator, you will be responsible for managing both the HDFS cluster and the MapReduce cluster. In general, system administrators should maintain the health and availability of the cluster. More specifically, for an HDFS cluster, it means the management of the NameNodes and DataNodes and the management of the JobTrackers and TaskTrackers for MapReduce. Other administrative tasks include the management of Hadoop jobs, for example configuring job scheduling policy with schedulers. Managing the HDFS cluster The health of HDFS is critical for a Hadoop-based Big Data platform. HDFS problems can negatively affect the efficiency of the cluster. Even worse, it can make the cluster not function properly. For example, DataNode's unavailability caused by network segmentation can lead to some under-replicated data blocks. When this happens, HDFS will automatically replicate those data blocks, which will bring a lot of overhead to the cluster and cause the cluster to be too unstable to be available for use. In this recipe, we will show commands to manage an HDFS cluster. Getting ready Before getting started, we assume that our Hadoop cluster has been properly configured and all the daemons are running without any problems. Log in to the master node from the administrator machine with the following command: ssh hduser@master How to do it... Use the following steps to check the status of an HDFS cluster with hadoop fsck: Check the status of the root filesystem with the following command: hadoop fsck / We will get an output similar to the following: FSCK started by hduser from /10.147.166.55 for path / at Thu Feb 28 17:14:11 EST 2013 .. /user/hduser/.staging/job_201302281211_0002/job.jar: Under replicated blk_-665238265064328579_1016. Target Replicas is 10 but found 5 replica(s). .................................Status: HEALTHY Total size: 14420321969 B Total dirs: 22 Total files: 35 Total blocks (validated): 241 (avg. block size 59835360 B) Minimally replicated blocks: 241 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 2 (0.8298755 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 2.0248964 Corrupt blocks: 0 Missing replicas: 10 (2.0491803 %) Number of data-nodes: 5 Number of racks: 1 FSCK ended at Thu Feb 28 17:14:11 EST 2013 in 28 milliseconds The filesystem under path '/' is HEALTHY The output shows that some percentage of data blocks is under-replicated. But because HDFS can automatically make duplication for those data blocks, the HDFS filesystem and the '/' directory are both HEALTHY. Check the status of all the files on HDFS with the following command: hadoop fsck / -files We will get an output similar to the following: FSCK started by hduser from /10.147.166.55 for path / at Thu Feb 28 17:40:35 EST 2013 / <dir> /home <dir> /home/hduser <dir> /home/hduser/hadoop <dir> /home/hduser/hadoop/tmp <dir> /home/hduser/hadoop/tmp/mapred <dir> /home/hduser/hadoop/tmp/mapred/system <dir> /home/hduser/hadoop/tmp/mapred/system/jobtracker.info 4 bytes, 1 block(s): OK /user <dir> /user/hduser <dir> /user/hduser/randtext <dir> /user/hduser/randtext/_SUCCESS 0 bytes, 0 block(s): OK /user/hduser/randtext/_logs <dir> /user/hduser/randtext/_logs/history <dir> /user/hduser/randtext/_logs/history/job_201302281451_0002_13620904 21087_hduser_random-text-writer 23995 bytes, 1 block(s): OK /user/hduser/randtext/_logs/history/job_201302281451_0002_conf.xml 22878 bytes, 1 block(s): OK /user/hduser/randtext/part-00001 1102231864 bytes, 17 block(s): OK Status: HEALTHY Hadoop will scan and list all the files in the cluster. This command scans all ? les on HDFS and prints the size and status. Check the locations of file blocks with the following command: hadoop fsck / -files -locations The output of this command will contain the following information: The first line tells us that file part-00000 has 17 blocks in total and each block has 2 replications (replication factor has been set to 2). The following lines list the location of each block on the DataNode. For example, block blk_6733127705602961004_1127 has been replicated on hosts 10.145.231.46 and 10.145.223.184. The number 50010 is the port number of the DataNode. Check the locations of file blocks containing rack information with the following command: hadoop fsck / -files -blocks -racks Delete corrupted files with the following command: hadoop fsck -delete Move corrupted files to /lost+found with the following command: hadoop fsck -move Use the following steps to check the status of an HDFS cluster with hadoop dfsadmin: Report the status of each slave node with the following command: hadoop dfsadmin -report The output will be similar to the following: Configured Capacity: 422797230080 (393.76 GB) Present Capacity: 399233617920 (371.82 GB) DFS Remaining: 388122796032 (361.47 GB) DFS Used: 11110821888 (10.35 GB) DFS Used%: 2.78% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 5 (5 total, 0 dead) Name: 10.145.223.184:50010 Decommission Status : Normal Configured Capacity: 84559446016 (78.75 GB) DFS Used: 2328719360 (2.17 GB) Non DFS Used: 4728565760 (4.4 GB) DFS Remaining: 77502160896(72.18 GB) DFS Used%: 2.75% DFS Remaining%: 91.65% Last contact: Thu Feb 28 20:30:11 EST 2013 ... The first section of the output shows the summary of the HDFS cluster, including the configured capacity, present capacity, remaining capacity, used space, number of under-replicated data blocks, number of data blocks with corrupted replicas, and number of missing blocks. The following sections of the output information show the status of each HDFS slave node, including the name (ip:port) of the DataNode machine, commission status, configured capacity, HDFS and non-HDFS used space amount, HDFS remaining space, and the time that the slave node contacted the master. Refresh all the DataNodes using the following command: hadoop dfsadmin -refreshNodes Check the status of the safe mode using the following command: hadoop dfsadmin -safemode get We will be able to get the following output: Safe mode is OFF The output tells us that the NameNode is not in safe mode. In this case, the filesystem is both readable and writable. If the NameNode is in safe mode, the filesystem will be read-only (write protected). Manually put the NameNode into safe mode using the following command: hadoop dfsadmin -safemode enter This command is useful for system maintenance. Make the NameNode to leave safe mode using the following command: hadoop dfsadmin -safemode leave If the NameNode has been in safe mode for a long time or it has been put into safe mode manually, we need to use this command to let the NameNode leave this mode. Wait until NameNode leaves safe mode using the following command: hadoop dfsadmin -safemode wait This command is useful when we want to wait until HDFS finishes data block replication or wait until a newly commissioned DataNode to be ready for service. Save the metadata of the HDFS filesystem with the following command: hadoop dfsadmin -metasave meta.log The meta.log file will be created under the directory $HADOOP_HOME/logs. Its content will be similar to the following: 21 files and directories, 88 blocks = 109 total Live Datanodes: 5 Dead Datanodes: 0 Metasave: Blocks waiting for replication: 0 Metasave: Blocks being replicated: 0 Metasave: Blocks 0 waiting deletion from 0 datanodes. Metasave: Number of datanodes: 5 10.145.223.184:50010 IN 84559446016(78.75 GB) 2328719360(2.17 GB) 2.75% 77502132224(72.18 GB) Thu Feb 28 21:43:52 EST 2013 10.152.166.137:50010 IN 84559446016(78.75 GB) 2357415936(2.2 GB) 2.79% 77492854784(72.17 GB) Thu Feb 28 21:43:52 EST 2013 10.145.231.46:50010 IN 84559446016(78.75 GB) 2048004096(1.91 GB) 2.42% 77802893312(72.46 GB) Thu Feb 28 21:43:54 EST 2013 10.152.161.43:50010 IN 84559446016(78.75 GB) 2250854400(2.1 GB) 2.66% 77600096256(72.27 GB) Thu Feb 28 21:43:52 EST 2013 10.152.175.122:50010 IN 84559446016(78.75 GB) 2125828096(1.98 GB) 2.51% 77724323840(72.39 GB) Thu Feb 28 21:43:53 EST 2013 21 files and directories, 88 blocks = 109 total ... How it works... The HDFS filesystem will be write protected when NameNode enters safe mode. When an HDFS cluster is started, it will enter safe mode first. The NameNode will check the replication factor for each data block. If the replica count of a data block is smaller than the configured value, which is 3 by default, the data block will be marked as under-replicated. Finally, an under-replication factor, which is the percentage of under-replicated data blocks, will be calculated. If the percentage number is larger than the threshold value, the NameNode will stay in safe mode until enough new replicas are created for the under-replicated data blocks so as to make the under-replication factor lower than the threshold. We can get the usage of the fsck command using: hadoop fsck The usage information will be similar to the following: Usage: DFSck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] <path> start checking from this path -move move corrupted files to /lost+found -delete delete corrupted files -files print out files being checked -openforwrite print out files opened for write -blocks print out block report -locations print out locations for every block -racks print out network topology for data-node locations By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually tagged CORRUPT or HEALTHY depending on their block allocation status.   We can get the usage of the dfsadmin command using: hadoop dfsadmin The output will be similar to the following: Usage: java DFSAdmin [-report] [-safemode enter | leave | get | wait] [-saveNamespace] [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-refreshServiceAcl] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname>...<dirname>] [-setSpaceQuota <quota> <dirname>...<dirname>] [-clrSpaceQuota <dirname>...<dirname>] [-setBalancerBandwidth <bandwidth in bytes per second>] [-help [cmd]] There's more… Besides using command line, we can use the web UI to check the status of an HDFS cluster. For example, we can get the status information of HDFS by opening the link http://master:50070/dfshealth.jsp. We will get a web page that shows the summary of the HDFS cluster such as the configured capacity and remaining space. For example, the web page will be similar to the following screenshot: By clicking on the Live Nodes link, we can check the status of each DataNode. We will get a web page similar to the following screenshot: By clicking on the link of each node, we can browse the directory of the HDFS filesystem. The web page will be similar to the following screenshot: The web page shows that file /user/hduser/randtext has been split into five partitions. We can browse the content of each partition by clicking on the part-0000x link. Configuring SecondaryNameNode Hadoop NameNode is a single point of failure. By configuring SecondaryNameNode, the filesystem image and edit log files can be backed up periodically. And in case of NameNode failure, the backup files can be used to recover the NameNode. In this recipe, we will outline steps to configure SecondaryNameNode. Getting ready We assume that Hadoop has been configured correctly. Log in to the master node from the cluster administration machine using the following command: ssh hduser@master How to do it... Perform the following steps to configure SecondaryNameNode: Stop the cluster using the following command: stop-all.sh Add or change the following into the file $HADOOP_HOME/conf/hdfs-site.xml: <property> <name>fs.checkpoint.dir</name> <value>/hadoop/dfs/namesecondary</value> </property> If this property is not set explicitly, the default checkpoint directory will be ${hadoop.tmp.dir}/dfs/namesecondary. Start the cluster using the following command: start-all.sh The tree structure of the NameNode data directory will be similar to the following: ${dfs.name.dir}/ ├── current │ ├── edits │ ├── fsimage │ ├── fstime │ └── VERSION ├── image │ └── fsimage ├── in_use.lock └── previous.checkpoint ├── edits ├── fsimage ├── fstime └── VERSION And the tree structure of the SecondaryNameNode data directory will be similar to the following: ${fs.checkpoint.dir}/ ├── current │ ├── edits │ ├── fsimage │ ├── fstime │ └── VERSION ├── image │ └── fsimage └── in_use.lock There's more... To increase redundancy, we can configure NameNode to write filesystem metadata on multiple locations. For example, we can add an NFS shared directory for backup by changing the following property in the file $HADOOP_HOME/conf/hdfs-site.xml: <property> <name>dfs.name.dir</name> <value>/hadoop/dfs/name,/nfs/name</value> </property> Managing the MapReduce cluster A typical MapReduce cluster is composed of one master node that runs the JobTracker and a number of slave nodes that run TaskTrackers. The task of managing a MapReduce cluster includes maintaining the health as well as the membership between TaskTrackers and the JobTracker. In this recipe, we will outline commands to manage a MapReduce cluster. Getting ready We assume that the Hadoop cluster has been properly configured and running. Log in to the master node from the cluster administration machine using the following command: ssh hduser@master How to do it... Perform the following steps to manage a MapReduce cluster: List all the active TaskTrackers using the following command: hadoop -job -list-active-trackers This command can help us check the registration status of the TaskTrackers in the cluster. Check the status of the JobTracker safe mode using the following command: hadoop mradmin -safemode get We will get the following output: Safe mode is OFF The output tells us that the JobTracker is not in safe mode. We can submit jobs to the cluster. If the JobTracker is in safe mode, no jobs can be submitted to the cluster. Manually let the JobTracker enter safe mode using the following command: hadoop mradmin -safemode enter This command is handy when we want to maintain the cluster. Let the JobTracker leave safe mode using the following command: hadoop mradmin -safemode leave When maintenance tasks are done, you need to run this command. If we want to wait for safe mode to exit, the following command can be used: hadoop mradmin -safemode wait Reload the MapReduce queue configuration using the following command: hadoop mradmin -refreshQueues Reload active TaskTrackers using the following command: hadoop mradmin -refreshNodes How it works... Get the usage of the mradmin command using the following: hadoop mradmin The usage information will be similar to the following: Usage: java MRAdmin [-refreshServiceAcl] [-refreshQueues] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-refreshNodes] [-safemode <enter | leave | get | wait>] [-help [cmd]] ... The meaning of the command options is listed in the following table: Option Description -refreshServiceAcl Force JobTracker to reload service ACL. -refreshQueues Force JobTracker to reload queue configurations. -refreshUserToGroupsMappings Force JobTracker to reload user group mappings. -refreshSuperUserGroupsConfiguration Force JobTracker to reload super user group mappings. -refreshNodes Force JobTracker to refresh the JobTracker hosts. -help [cmd] Show the help info for a command or all commands. Summary In this article, we learned Managing the HDFS cluster, configuring SecondaryNameNode, and managing the MapReduce cluster. As a Hadoop cluster administrator, as the system administrator is responsible for managing both the HDFS cluster and the MapReduce cluster, he/she must be aware of how to manage these in order to maintain the health and availability of the cluster. More specifically, for an HDFS cluster, it means the management of the NameNodes and DataNodes and the management of the JobTrackers and TaskTrackers for MapReduce, which is covered in this article. Resources for Article : Further resources on this subject: Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate) [Article] Advanced Hadoop MapReduce Administration [Article] Understanding MapReduce [Article]
Read more
  • 0
  • 0
  • 16010
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-installing-mariadb-windows-and-mac-os-x
Packt
22 Oct 2013
5 min read
Save for later

Installing MariaDB on Windows and Mac OS X

Packt
22 Oct 2013
5 min read
(For more resources related to this topic, see here.) Installing MariaDB on Windows There are two types of MariaDB downloads for Windows: ZIP files and MSI packages. As mentioned previously, the ZIP files are similar to the Linux binary .tar.gz files and they are only recommended for experts who know they want it. If we are starting out with MariaDB on Windows, it is recommended to use the MSI packages. Here are the steps to do just that: Download the MSI package from https://downloads.mariadb.org/. First click on the series we want (stable, most likely), then locate the Windows 64-bit or Windows 32-bit MSI package. For most computers, the 64-bit MSI package is probably the one that we want, especially if we have more than 4 Gigabytes of RAM. If you're unsure, the 32-bit package will work on both 32-bit and 64-bit computers. Once the download has finished, launch the MSI installer by double-clicking on it. Depending on our settings we may be prompted to launch it automatically. The installer will walk us through installing MariaDB. If we are installing MariaDB for the first time, we must be sure to set the root user password when prompted. Unless we need to, don't enable access from remote machines for the root user or create an anonymous account. The Install as service box is checked by default, and it's recommended to keep it that way so that MariaDB starts up when the computer is booted. The Service Name textbox has the default value MySQL for compatibility reasons, but we can rename it if we like. Check the Enable networking option, if you need to access the databases from a different computer. If we don't it's best to uncheck this box. As with the service name, there is a default TCP port number (3306) which you can change if you want to, but it is usually best to stick with the default unless there is a specific reason not to. The Optimize for transactions checkbox is checked by default. This setting can be left as is. There are other settings that we can make through the installer. All of them can be changed later by editing the my.ini file, so we don't have to worry about setting them right away. If our version of Windows has User Account Control enabled, there will be a pop-up during the installation asking if we want to allow the installer to install MariaDB. For obvious reasons, click on Yes. After the installation completes, there will be a MariaDB folder added to the start menu. Under this will be various links, including one to the MySQL Client. If we already have an older version of MariaDB or MySQL running on our machine, we will be prompted to upgrade the data files for the version we are installing, it is highly recommended that we do so. Eventually we will be presented with a dialog box with an installation complete message and a Finish button. If you got this far, congratulations! MariaDB is now installed and running on your Windows-based computer. Click on Finish to quit the installer. Installing MariaDB on Mac OS X One of the easiest ways to install MariaDB on Mac OS X is to use Homebrew, which is an Open Source package manager for that platform. Before you can install it, however, you need to prepare your system. The first thing you need to do is install Xcode; Apple's integrated development environment. It's available for free in the Mac App Store. Once Xcode is installed you can install brew. Full instructions are available on the Brew Project website at http://mxcl.github.io/homebrew/ but the basic procedure is to open a terminal and run the following command: ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)" This command downloads the installer and runs it. Once the initial installation is completed, we run the following command to make sure everything is set up properly: brew doctor The output of the doctor command will tell us of any potential issues along with suggestions for how to fix them. Once brew is working properly, you can install MariaDB with the following commands: brew update brew install mariadb Unlike on Linux and Windows, brew does not automatically set up or offer to set up MariaDB to start automatically when your system boots or start MariaDB after installation. To do so, we perform the following command: ln -sfv /usr/local/opt/mariadb/*.plist ~/Library/LaunchAgents launchctl load ~/Library/LaunchAgents/homebrew.mxcl.mariadb.plist To stop MariaDB, we use the unload command as follows: launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.mariadb.plist Summary In this article, we learned how to install MariaDB on Windows and Mac OS X. Resources for Article: Further resources on this subject: Ruby with MongoDB for Web Development [Article] So, what is MongoDB? [Article] Schemas and Models [Article]
Read more
  • 0
  • 0
  • 15998

article-image-googlers-launch-industry-wide-awareness-campaign-to-fight-against-forced-arbitration
Natasha Mathur
17 Jan 2019
6 min read
Save for later

Googlers launch industry-wide awareness campaign to fight against forced arbitration

Natasha Mathur
17 Jan 2019
6 min read
A group of Googlers launched a public awareness social media campaign from 9 AM to 6 PM EST yesterday. The group, called, ‘Googlers for ending forced arbitration’ shared information about arbitration on their Twitter and Instagram accounts throughout the day. https://twitter.com/endforcedarb/status/1084813222505410560 The group tweeted out yesterday, as part of the campaign, that in surveying employees of 30+ tech companies and 10+ common Temp/Contractor suppliers in the industry, none of them could meet the three primary criteria needed for a transparent workplace. The three basic criteria include: optional arbitration policy for all employees and for all forms of discrimination (including contractors/temps), no class action waivers, and no gag rule that keeps arbitration hearings proceedings confidential. The group shared some hard facts about Arbitration and also busted myths regarding the same. Let’s have a look at some of the key highlights from yesterday’s campaign. At least 60 million Americans are forced to use arbitration The group states that the implementation of forced arbitration policy has grown significantly in the past seven years. Over 65% of the companies consisting of 1,000 or more employees, now have mandatory arbitration procedures. Employees don’t have an option to take their employers to court in cases of harassment or discrimination. People of colour and women are often the ones who get affected the most by this practice.           How employers use forced Arbitration Forced arbitration is extremely unfair Arbitration firms that are hired by the companies usually always favour the companies over its employees. This is due to the fear of being rejected the next time by an employer lest the arbitration firm decides to favour the employee. The group states that employees are 1.7 times more likely to win in Federal courts and 2.6 times more likely to win in state courts than in arbitration.   There are no public filings of the complaint details, meaning that the company won’t have anyone to answer to regarding the issues within the organization. The company can also limit its obligation when it comes to disclosing the evidence that you need to prove your case.   Arbitration hearings happen behind closed doors within a company When it comes to arbitration hearings, it's just an employee and their lawyer, other party and their lawyer, along with a panel of one to three arbitrators. Each party gets to pick one arbitrator each, who is also hired by your employers. However, there’s usually only a single arbitrator panel involved as three-arbitrator panel costs five times more than a single arbitrator panel, as per the American Arbitration Association. Forced Arbitration requires employees to sign away their right to class action lawsuits at the start of the employment itself The group states that irrespective of having legal disputes or not, forced arbitration bans employees from coming together as a group in case of arbitration as well as in case of class action lawsuits. Most employers also practice “gag rule” which restricts the employee to even talk about their experience with the arbitration policy. There are certain companies that do give you an option to opt out of forced arbitration using an opt-out form but comes with a time constraint depending on your agreement with that company. For instance, companies such as Twitter, Facebook, and Adecco give their employees a chance to opt out of forced arbitration.                                                  Arbitration opt-out option JAMS and AAA are among the top arbitration organizations used by major tech giants JAMS, Judicial Arbitration and Mediation Services, is a private company that is used by employers like Google, Airbnb, Uber, Tesla, and VMware. JAMS does not publicly disclose the diversity of its arbitrators. Similarly, AAA, America Arbitration Association, is a non-profit organization where usually retired judges or lawyers serve as arbitrators. Arbitrators in AAA have an overall composition of 24% women and minorities. AAA is one of the largest arbitration organizations used by companies such as Facebook, Lyft, Oracle, Samsung, and Two Sigma.   Katherine Stone, a professor from UCLA law school, states that the procedure followed by these arbitration firms don’t allow much discovery. What this means is that these firms don’t usually permit depositions or various kinds of document exchange before the hearing. “So, the worker goes into the hearing...armed with nothing, other than their own individual grievances, their own individual complaints, and their own individual experience. They can’t learn about the experience of others,” says Stone. Female workers and African-American workers are most likely to suffer from forced arbitration 58% female workers and 59% African American workers face mandatory arbitration depending on the workgroups. For instance, in the construction industry, which is a highly male-dominated industry, the imposition of forced arbitration is at the lowest rate. But, in the education and health industries, which has the majority of the female workforce, the imposition rate of forced arbitration is high.                                 Forced Arbitration rate among different workgroups Supreme Court has gradually allowed companies to expand arbitration to employees & consumers The group states that the 1925 Federal Arbitration Act (FAA) had legalized arbitration between shipping companies in cases of settling commercial disputes. The supreme court, however, expanded this practice of arbitration to companies too.                                                   Supreme court decisions Apart from sharing these facts, the group also shed insight on dos and don’t that employees should follow under forced arbitration clauses.                                                      Dos and Dont’s The social media campaign by Googlers for forced arbitration represents an upsurge in the strength and courage among the employees within the tech industry as not just the Google employees but also employees from different tech companies shared their experience regarding forced arbitration. The group had researched academic institutions, labour attorneys, advocacy groups, etc, and the contracts of around 30 major tech companies, as a part of the campaign. To follow all the highlights from the campaign, follow the End Forced Arbitration Twitter account. Shareholders sue Alphabet’s board members for protecting senior execs accused of sexual harassment Recode Decode #GoogleWalkout interview shows why data and evidence don’t always lead to right decisions in even the world’s most data-driven company Tech Workers Coalition volunteers talk unionization and solidarity in Silicon Valley
Read more
  • 0
  • 0
  • 15985

article-image-4-must-know-levels-in-mongodb-security
Amey Varangaonkar
01 Mar 2018
8 min read
Save for later

4 must-know levels in MongoDB security

Amey Varangaonkar
01 Mar 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x written by Alex Giamas. It presents the techniques and essential concepts needed to tackle even the trickiest problems when it comes to working and administering your MongoDB instance.[/box] Security is a multifaceted goal in a MongoDB cluster. In this article, we will examine different attack vectors and how we can protect MongoDB against them. 1. Authentication in MongoDB Authentication refers to verifying the identity of a client. This prevents impersonating someone else in order to gain access to our data. The simplest way to authenticate is using a username/password pair. This can be done via the shell in two ways: > db.auth( <username>, <password> ) Passing in a comma separated username and password will assume default values for the rest of the fields: > db.auth( { user: <username>, pwd: <password>, mechanism: <authentication mechanism>, digestPassword: <boolean> } ) If we pass a document object we can define more parameters than username/password. The (authentication) mechanism parameter can take several different values with the default being SCRAM-SHA-1. The parameter value MONGODB-CR is used for backwards compatibility with versions earlier than 3.0 MONGODB-X509 is used for TLS/SSL authentication. Users and internal replica set servers can be authenticated using SSL certificates, which are self-generated and signed, or come from a trusted third-party authority. This for the configuration file: security.clusterAuthMode / net.ssl.clusterFile Or like this on the command line: --clusterAuthMode and --sslClusterFile > mongod --replSet <name> --sslMode requireSSL --clusterAuthMode x509 --sslClusterFile <path to membership certificate and key PEM file> --sslPEMKeyFile <path to SSL certificate and key PEM file> --sslCAFile <path to root CA PEM file> MongoDB Enterprise Edition, the paid offering from MongoDB Inc., adds two more options for authentication. The first added option is GSSAPI (Kerberos). Kerberos is a mature and robust authentication system that can be used, among others, for Windows based Active Directory Deployments. The second added option is PLAIN (LDAP SASL). LDAP is just like Kerberos; a mature and robust authentication mechanism. The main consideration when using PLAIN authentication mechanism is that credentials are transmitted in plaintext over the wire. This means that we should secure the path between client and server via VPN or a TSL/SSL connection to avoid a man in the middle stealing our credentials. 2. Authorization in MongoDB After we have configured authentication to verify that users are who they claim they are when connecting to our MongoDB server, we need to configure the rights that each one of them will have in our database. This is the authorization aspect of permissions. MongoDB uses role-based access control to control permissions for different user classes. Every role has permissions to perform some actions on a resource. A resource can be a collection or a database or any collections or any databases. The command's format is: { db: <database>, collection: <collection> } If we specify "" (empty string) for either db or collection it means any db or collection. For example: { db: "mongo_books", collection: "" } This would apply our action in every collection in database mongo_books. Similar to the preceding, we can define: { db: "", collection: "" } We define this to apply our rule to all collections across all databases, except system collections of course. We can also apply rules across an entire cluster as follows: { resource: { cluster : true }, actions: [ "addShard" ] } The preceding example grants privileges for the addShard action (adding a new shard to our system) across the entire cluster. The cluster resource can only be used for actions that affect the entire cluster rather than a collection or database, as for example shutdown, replSetReconfig, appendOplogNote, resync, closeAllDatabases, and addShard. What follows is an extensive list of cluster specific actions and some of the most widely used actions. The list of most widely used actions are: find insert remove update bypassDocumentValidation viewRole / viewUser createRole / dropRole createUser / dropUser inprog killop replSetGetConfig / replSetConfigure / replSetStateChange / resync getShardMap / getShardVersion / listShards / moveChunk / removeShard / addShard dropDatabase / dropIndex / fsync / repairDatabase / shutDown serverStatus / top / validate Cluster-specific actions are: unlock authSchemaUpgrade cleanupOrphaned cpuProfiler inprog invalidateUserCache killop appendOplogNote replSetConfigure replSetGetConfig replSetGetStatus replSetHeartbeat replSetStateChange resync addShard flushRouterConfig getShardMap listShards removeShard shardingState applicationMessage closeAllDatabases connPoolSync fsync getParameter hostInfo logRotate setParameter shutdown touch connPoolStats cursorInfo diagLogging getCmdLineOpts getLog listDatabases netstat serverStatus top If this sounds too complicated that is because it is. The flexibility that MongoDB allows in configuring different actions on resources means that we need to study and understand the extensive lists as described previously. Thankfully, some of the most common actions and resources are bundled in built-in roles. We can use the built-in roles to establish the baseline of permissions that we will give to our users and then fine grain these based on the extensive list. User roles in MongoDB There are two different generic user roles that we can specify: read: A read-only role across non-system collections and the following system collections: system.indexes, system.js, and system.namespaces collections readWrite: A read and modify role across non-system collections and the system.js collection Database administration roles in MongoDB There are three database specific administration roles shown as follows: dbAdmin: The basic admin user role which can perform schema-related tasks, indexing, gathering statistics. A dbAdmin cannot perform user and role management. userAdmin: Create and modify roles and users. This is complementary to the dbAdmin role. dbOwner: Combining readWrite, dbAdmin, and userAdmin roles, this is the most powerful admin user role. Cluster administration roles in MongoDB These are the cluster wide administration roles available: hostManager: Monitor and manage servers in a cluster. clusterManager: Provides management and monitoring actions on the cluster. A user with this role can access the config and local databases, which are used in sharding and replication, respectively. clusterMonitor: Read-only access for monitoring tools provided by MongoDB such as MongoDB Cloud Manager and Ops Manager agent. clusterAdmin: Provides the greatest cluster-management access. This role combines the privileges granted by the clusterManager, clusterMonitor, and hostManager roles. Additionally, the role provides the dropDatabase action. Backup restore roles Role-based authorization roles can be defined in the backup restore granularity level as Well: backup: Provides privileges needed to back-up data. This role provides sufficient privileges to use the MongoDB Cloud Manager backup agent, Ops Manager backup agent, or to use mongodump. restore: Provides privileges needed to restore data with mongorestore without the --oplogReplay option or without system.profile collection data. Roles across all databases Similarly, here are the set of available roles across all databases: readAnyDatabase: Provides the same read-only permissions as read, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. readWriteAnyDatabase: Provides the same read and write permissions as readWrite, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. userAdminAnyDatabase: Provides the same access to user administration operations as userAdmin, except it applies to all but the local and config databases in the cluster. Since the userAdminAnyDatabase role allows users to grant any privilege to any user, including themselves, the role also indirectly provides superuser access. dbAdminAnyDatabase: Provides the same access to database administration operations as dbAdmin, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. Superuser Finally, these are the superuser roles available: root: Provides access to the operations and all the resources of the readWriteAnyDatabase, dbAdminAnyDatabase, userAdminAnyDatabase, clusterAdmin, restore, and backup combined. __internal: Similar to root user, any __internal user can perform any action against any object across the server. 3. Network level security Apart from MongoDB specific security measures, there are best practices established for network level security: Only allow communication between servers and only open the ports that are used for communicating between them. Always use TLS/SSL for communication between servers. This prevents man-inthe- middle attacks impersonating a client. Always use different sets of development, staging, and production environments and security credentials. Ideally, create different accounts for each environment and enable two-factor authentication in both staging and production environments. 4. Auditing security No matter how much we plan our security measures, a second or third pair of eyes from someone outside our organization can give a different view of our security measures and uncover problems that we may not have thought of or underestimated. Don't hesitate to involve security experts / white hat hackers to do penetration testing in your servers. Special cases Medical or financial applications require added levels of security for data privacy reasons. If we are building an application in the healthcare space, accessing users' personal identifiable information, we may need to get HIPAA certified. If we are building an application interacting with payments and managing cardholder information, we may need to become PCI/DSS compliant. The specifics of each certification are outside the scope of this book but it is important to know that MongoDB has use cases in these fields that fulfill the requirements and as such it can be the right tool with proper design beforehand. To sum up, in addition to the best practices listed above, developers and administrators must always use common sense so that security interferes only as much as needed with operational goals. If you found our article useful, make sure to check out this book Mastering MongoDB 3.x to master other MongoDB administration-related techniques and become a true MongoDB expert.  
Read more
  • 0
  • 0
  • 15876

article-image-google-rejects-all-13-shareholder-proposals-at-its-annual-meeting-despite-protesting-workers
Sugandha Lahoti
20 Jun 2019
11 min read
Save for later

Google rejects all 13 shareholder proposals at its annual meeting, despite protesting workers

Sugandha Lahoti
20 Jun 2019
11 min read
Yesterday at Google’s annual shareholder meeting, Alphabet, Google’s parent company, faced 13 independent stockholder proposals ranging from sexual harassment and diversity, to the company's policies regarding China and forced arbitration. There was also a proposal to limit Alphabet's power by breaking up the company. However, as expected, every stockholder proposal was voted down after a few minutes of ceremonial voting, despite protesting workers. The company’s co-founders, Larry Page, and Sergey Brin were both no-shows and Google CEO Sundar Pichai didn’t answer any questions. Google has seen a massive escalation in employee activism since 2018. The company faced backlash from its employees over a censored version of its search engine for China, forced arbitration policies, and its mishandling of sexual misconduct. Google Walkout for Real Change organizers Claire Stapleton, and Meredith Whittaker were also retaliated against by their supervisors for speaking out. According to widespread media reports the US Department of Justice is readying to investigate into Google. It has been reported that the probe would examine whether the tech giant broke antitrust law in the operation of its online and advertisement businesses. Source: Google’s annual meeting MOM Equal Shareholder Voting Shareholders requested that Alphabet’s Board initiate and adopt a recapitalization plan for all outstanding stock to have one vote per share. Currently, the company has a multi-class voting structure, where each share of Class A common stock has one vote and each share of Class B common stock has 10 votes. As a result, Page and Brin currently control over 51% of the company’s total voting power, while owning less than 13% of stock. This raises concerns that the interests of public shareholders may be subordinated to those of our co-founders. However, board of directors rejected the proposal stating that their capital and governance structure has provided significant stability to the company, and is therefore in the best interests of the stockholders. Commit to not use Inequitable Employment Practices This proposal urged Alphabet to commit not to use any of the Inequitable Employment Practices, to encourage focus on human capital management and improve accountability.  “Inequitable Employment Practices” are mandatory arbitration of employment-related claims, non-compete agreements with employees, agreements with other companies not to recruit one another’s employees, and involuntary non-disclosure agreements (“NDAs”) that employees are required to sign in connection with settlement of claims that any Alphabet employee engaged in unlawful discrimination or harassment. Again Google rejected this proposal stating that the updated code of conduct already covers these requirements and do not believe that implementing this proposal would provide any incremental value or benefit to its stockholders or employees. Establishment of a Societal Risk Oversight Committee Stockholders asked Alphabet to establish a Societal Risk Oversight Committee (“the Committee”) of the Board of Directors, composed of independent directors with relevant experience. The Committee should provide an ongoing review of corporate policies and procedures, above and beyond legal and regulatory matters, to assess the potential societal consequences of the Company’s products and services, and should offer guidance on strategic decisions. As with the other Board committees, a formal charter for the Committee and a summary of its functions should be made publicly available. This proposal was also rejected. Alphabet said, “The current structure of our Board of Directors and its committees (Audit, Leadership Development and Compensation, and Nominating and Corporate Governance) already covers these issues.” Report on Sexual Harassment Risk Management Shareholders requested management to review its policies related to sexual harassment to assess whether the company needs to adopt and implement additional policies and to report its findings, omitting proprietary information and prepared at a reasonable expense by December 31, 2019. However, this was rejected as well based on the conclusion that Google has robust Conduct Policies already in place. There is also ongoing improvement and reporting in this area, and Google has a concrete plan of action to do more. Majority Vote for the Election of Directors This proposal demands that director nominees should be elected by the affirmative vote of the majority of votes cast at an annual meeting of shareholders, with a plurality vote standard retained for contested director elections, i.e., when the number of director nominees exceeds the number of board seats. This proposal also demands that a director who receives less than such a majority vote should be removed from the board immediately or as soon as a replacement director can be qualified on an expedited basis. This proposal was rejected basis, “Our Board of Directors believes that current nominating and voting procedures for election to our Board of Directors, as opposed to a mandated majority voting standard, provide the board the flexibility to appropriately respond to stockholder interests without the risk of potential corporate governance complications arising from failed elections.” Report on Gender Pay Stockholders demand that Google should report on the company’s global median gender pay gap, including associated policy, reputational, competitive, and operational risks, and risks related to recruiting and retaining female talent. The report should be prepared at a reasonable cost, omitting proprietary information, litigation strategy, and legal compliance information. Google says it already releases an annual report on its pay equity analyses, “ensuring they pay fairly and equitably.” They don’t think an additional report as detailed in the proposal above would enhance Alphabet’s existing commitment to fostering a fair and inclusive culture. Strategic Alternatives The proposal outlines the need for an orderly process to retain advisors to study strategic alternatives. They believe Alphabet may be too large and complex to be managed effectively and want a voluntary strategic reduction in the size of the company than from asset sales compelled by regulators. They also want a committee of independent directors to evaluate those alternatives in the exercise of their fiduciary responsibilities to maximize shareholder value. This proposal was also rejected basis, “Our Board of Directors and management do not favor a given size of the company or focus on any strategy based on ideological grounds. Instead, we develop a strategy based on the company’s customers, partners, users and the communities we serve, and focus on strategies that maximize long-term sustainable stockholder value.” Nomination of an Employee Representative Director An Employee Representative Director should be nominated for election to the Board by shareholders at Alphabet’s 2020 annual meeting of shareholders. Stockholders say that employee representation on Alphabet’s Board would add knowledge and insight on issues critical to the success of the Company, beyond that currently present on the Board, and may result in more informed decision making. Alphabet quoted their “Consideration of Director Nominees” section of the proxy statement, stating that the Nominating and Corporate Governance Committee of Alphabet’s Board of Directors look for several critical qualities in screening and evaluating potential director candidates to serve our stockholders. They only allow the best and most qualified candidates to be elected to the Board of Directors. Accordingly, this proposal was also rejected. Simple Majority Vote The shareholders call for the simple majority vote to be eliminated and replaced by a requirement for a majority of the votes cast for and against applicable proposals, or a simple majority in compliance with applicable laws. Currently, the voice of regular shareholders is diminished because certain insider shares have 10-times as many votes per share as regular shares. Plus shareholders have no right to act by written consent. The rejection reason laid out by Google was that more stringent voting requirements in certain limited circumstances are appropriate and in the best interests of Google’s stockholders and the company. Integrating Sustainability Metrics Report into Performance measures Shareholders requested the Board Compensation Committee to prepare a report assessing the feasibility of integrating sustainability metrics into performance measures. These should apply to senior executives under the Company’s compensation plans or arrangements. They state that Alphabet remains predominantly white, male, and occupationally segregated. Among Alphabet’s top 290 managers in 2017, just over one-quarter were women and only 17 managers were underrepresented people of color. Even after proof of Alphabet being not inclusive, the proposal was rejected. Alphabet said that the company already supports corporate sustainability, including environmental, social and diversity considerations. Google Search in China Google employees, as well as Human rights organizations, have called on Google to end work on Dragonfly. Google employees have quit to avoid working on products that enable censorship; 1,400 current employees have signed a letter protesting Dragonfly. Employees said: “Currently we do not have the information required to make ethically-informed decisions about our work, our projects, and our employment.” Some employees have threatened to strike. Dragonfly may also be inconsistent with Google’s AI Principles. The proposal urged Google to publish a Human Rights Impact Assessment by no later than October 30, 2019, examining the actual and potential impacts of censored Google search in China. Google rejected this proposal stating that before Google launches any new search product, it would conduct proper human rights due diligence, confer with Global Network Initiative (GNI) partners and other key stakeholders. If Google ever considers re-engaging this work, it will do so transparently, engaging and consulting widely. Clawback Policy Alphabet currently does not disclose an incentive compensation clawback policy in its proxy statement. The proposal urges Alphabet’s Leadership Development and Compensation Committee to adopt a clawback policy. This committee will review the incentive compensation paid, granted or awarded to a senior executive in case there is misconduct resulting in a material violation of law or Alphabet’s policy that causes significant financial or reputational harm to Alphabet. Alphabet rejected this proposal based on the following claims: The company has announced significant revisions to its workplace culture policies. Google has ended forced arbitration for employment claims. Any violation of our Code of Conduct or other policies may result in disciplinary action, including termination of employment and forfeiture of unvested equity awards. Report on Content Governance Stockholders are concerned that Alphabet’s Google is failing to effectively address content governance concerns, posing risks to shareholder value. They request Alphabet to issue a report to shareholders assessing the risks posed by content management controversies related to election interference, freedom of expression, and the spread of hate speech. Google said that they have already done significant work in the areas of combatting violent or extremist content, misinformation, and election interference and have continued to keep the public informed about our efforts. Thus they do not believe that implementing this proposal would benefit our stockholders. Read the full report here. Protestors showed disappointment As the annual meeting was conducted, thousands of activists protested across Google offices. A UK-based human rights organization called SumOfUs teamed up with Students for a Free Tibet to propose a breakup of Google. They organized a protest outside of 12 Google offices around the world to coincide with the shareholder meeting, including in San Francisco, Stockholm, and Mumbai. https://twitter.com/SumOfUs/status/1141363780909174784 Alphabet Board Chairman, John Hennessy opened the meeting by reflecting on Google's mission to provide information to the world. "Of course this comes with a deep and growing responsibility to ensure the technology we create benefits society as a whole," he said. "We are committed to supporting our users, our employees and our shareholders by always acting responsibly, inclusively and fairly." In the Q&A session which followed the meeting, protestors demanded why Page wasn't in attendance. "It's a glaring omission," one of the protestors said. "I think that's disgraceful." Hennessy responded, "Unfortunately, Larry wasn't able to be here" but noted Page has been at every board meeting. The stockholders pushed back, saying the annual meeting is the only opportunity for investors to address him. Toward the end of the meeting, protestors including Google employees, low-paid contract workers, and local community members were shouting outside. “We shouldn’t have to be here protesting. We should have been included.” One sign from the protest, listing the first names of the members of Alphabet’s board, read, “Sundar, Larry, Sergey, Ruth, Kent — You don’t speak for us!” https://twitter.com/GoogleWalkout/status/1141454091455008769 https://twitter.com/LauraEWeiss16/status/1141400193390141440   This meeting also marked the official departure of former Google CEO Eric Schmidt and former Google Cloud chief Diane Greene from Alphabet's board. Earlier this year, they said they wouldn’t be seeking re-election. Google Walkout organizer, Claire Stapleton resigns after facing retaliation from management US regulators plan to probe Google on anti-trust issues; Facebook, Amazon & Apple also under legal scrutiny. Google employees lay down actionable demands after staging a sit-in to protest retaliation
Read more
  • 0
  • 0
  • 15820
article-image-neurips-2018-how-machine-learning-experts-can-work-with-policymakers-to-make-good-tech-decisions-invited-talk
Bhagyashree R
18 Dec 2018
6 min read
Save for later

NeurIPS 2018: How machine learning experts can work with policymakers to make good tech decisions [Invited Talk]

Bhagyashree R
18 Dec 2018
6 min read
At the 32nd annual  NeurIPS conference held earlier this month, Edward William Felten, a professor of computer science and public affairs at Princeton University spoke about how decision makers and tech experts can work together to make better policies. The talk was aimed at answering questions such as why should public policy matter to AI researchers, what role can researchers play in policy debates, and how can researchers help bridge divides between the research and policy communities. While AI and machine learning are being used in high impact areas and have seen heavy adoption in every field, in recent years, they have also gained a lot of attention from the policymakers. Technology has become a huge topic of discussion among policymakers mainly because of its cases of failure and how it is being used or misused. They have now started formulating laws and regulations and holding discussions about how society will govern the development of these technologies. Prof. Felten explained how having constructive engagement with policymakers will lead to better outcomes for technology, government, and society. Why tech should be regulated? Regulating tech is important, and for that researchers, data scientists, and other people in tech fields have to close the gap between their research labs, cubicles, and society. Prof. Felten emphasizes that it is up to the tech people to bridge this gap as we not only have the opportunity but also a duty to be more active and productive in participating in public life. There are many people coming to the conclusion that tech should be regulated before it is too late. In a piece published by the Wall Street Journal, three experts debated about whether the government should regulate AI. One of them, Ryan Calo explains, “One of the ironies of artificial intelligence is that proponents often make two contradictory claims. They say AI is going to change everything, but there should be no changes to the law or legal institutions in response.” Prof. Felten points out that law and policies are meant to change in order to adapt according to the current conditions. They are not just written once and for all for the cases of today and the future, rather law is a living system that adapts to what is going on in the society. And, if we believe that technology is going to change everything, we can expect that law will change. Prof. Felten also said that not only the tech researchers and policymakers but the society also should also have some say in how the technology is developed, “After all the people who are affected by the change that we are going to cause deserve some say in how that change happens, how it is used. If we believe in a society which is fundamentally democratic in which everyone has a stake and everyone has a voice then it is only fair that those lives we are going to change have some say in how that change come about and what kind of changes are going to happen and which are not.” How experts can work with decision makers to make good tech decisions The three key approaches that we can take to engage with policymakers to take a decision about technology: Engage in a two-way dialogue with policymakers As a researcher, we might think that we are tech experts/scientists and we do not need to get involved in politics. We need to just share the facts we know and our job is done. But if researchers really want to maximize their impact in policy debates, they need to combine the knowledge and preferences of policymakers with their knowledge and preferences. Which means, they need to take into account what policymakers might already have heard about a particular subject and the issues or approaches that resonate with them. Prof. Felten explains that this type of understanding and exchange of ideas can be done in two stages. Researchers need to ask several questions to policymakers, which is not a one-time thing, rather a multi-round protocol. They have to go back and forth with the person and need to build engagement over time and mutual trust. And, then they need to put themselves into the shoes of a decision maker and understand how to structure the decision space for them. Be present in the room when the decisions are being made To have their influence on the decisions that get made, researchers need to have “boots on the ground.” Though not everyone has to engage in this deep and long-term process of decision making, we need some people from the community to engage on behalf of the community. Researchers need to be present in the room when the decisions are being made. This means taking posts as advisers or civil servants. We already have a range of such posts at both local and national government levels, alongside a range of opportunities to engage less formally in policy development and consultations. Creating a career path and rewarding policy engagement To drive this engagement, we need to create a career path which rewards policy engagement. We should have a way through which researchers can move between policy and research careers. Prof. Felten pointed to a range of US-based initiatives that seek to bring those with technical expertise into policy-oriented roles, such as the US Digital Service. He adds that if we do not create these career paths and if this becomes something that people can do only after sacrificing their careers then very few people will do it. This needs to be an activity that we learn to respect when people in the community do it well. We need to build incentives whether it is in career incentives in academia, whether it is understanding that working in government or on policy issues is a valuable part of one kind of academic career and not thinking of it as deter or a stop. To watch the full talk, check out NeurIPS Facebook page. NeurIPS 2018: Rethinking transparency and accountability in machine learning NeurIPS 2018: Developments in machine learning through the lens of Counterfactual Inference [Tutorial] Accountability and algorithmic bias: Why diversity and inclusion matters [NeurIPS Invited Talk]
Read more
  • 0
  • 0
  • 15736

article-image-feedforward-networks-tensorflow
Aarthi Kumaraswamy
07 Jun 2018
12 min read
Save for later

Implementing feedforward networks with TensorFlow

Aarthi Kumaraswamy
07 Jun 2018
12 min read
Deep feedforward networks, also called feedforward neural networks, are sometimes also referred to as Multilayer Perceptrons (MLPs). The goal of a feedforward network is to approximate the function of f∗. For example, for a classifier, y=f∗(x) maps an input x to a label y. A feedforward network defines a mapping from input to label y=f(x;θ). It learns the value of the parameter θ that results in the best function approximation. This tutorial is an excerpt from the book, Neural Network Programming with Tensorflow by Manpreet Singh Ghotra, and Rajdeep Dua. With this book, learn how to implement more advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow. How do feedforward networks work? Feedforward networks are a conceptual stepping stone on the path to recurrent networks, which power many natural language applications. Feedforward neural networks are called networks because they compose together many different functions which represent them. These functions are composed in a directed acyclic graph. The model is associated with a directed acyclic graph describing how the functions are composed together. For example, there are three functions f(1), f(2), and f(3) connected to form f(x) =f(3)(f(2)(f(1)(x))). These chain structures are the most commonly used structures of neural networks. In this case, f(1) is called the first layer of the network, f(2) is called the second layer, and so on. The overall length of the chain gives the depth of the model. It is from this terminology that the name deep learning arises. The final layer of a feedforward network is called the output layer. Diagram showing various functions activated on input x to form a neural network These networks are called neural because they are inspired by neuroscience. Each hidden layer is a vector. The dimensionality of these hidden layers determines the width of the model. Implementing feedforward networks with TensorFlow Feedforward networks can be easily implemented using TensorFlow by defining placeholders for hidden layers, computing the activation values, and using them to calculate predictions. Let's take an example of classification with a feedforward network: X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, hidden_size), stddev) weights_2 = initialize_weights((hidden_size, y_size), stddev) sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2) Once the predicted value tensor has been defined, we calculate the cost function: cost = tf.reduce_mean(tf.nn.OPERATION_NAME(labels=<actual value>, logits=<predicted value>)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) Here, OPERATION_NAME could be one of the following: tf.nn.sigmoid_cross_entropy_with_logits: Calculates sigmoid cross entropy on incoming logits and labels: sigmoid_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None )Formula implemented is max(x, 0) - x * z + log(1 + exp(-abs(x))) _sentinel: Used to prevent positional parameters. Internal, do not use. labels: A tensor of the same type and shape as logits. logits: A tensor of type float32 or float64. The formula implemented is ( x = logits, z = labels) max(x, 0) - x * z + log(1 + exp(-abs(x))). tf.nn.softmax: Performs softmax activation on the incoming tensor. This only normalizes to make sure all the probabilities in a tensor row add up to one. It cannot be directly used in a classification. softmax = exp(logits) / reduce_sum(exp(logits), dim) logits: A non-empty tensor. Must be one of the following types--half, float32, or float64. dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension. name: A name for the operation (optional). tf.nn.log_softmax: Calculates the log of the softmax function and helps in normalizing underfitting. This function is also just a normalization function. log_softmax( logits, dim=-1, name=None ) logits: A non-empty tensor. Must be one of the following types--half, float32, or float64. dim: The dimension softmax will be performed on. The default is -1, which indicates the last dimension. name: A name for the operation (optional). tf.nn.softmax_cross_entropy_with_logits softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, dim=-1, name=None ) _sentinel: Used to prevent positional parameters. For internal use only. labels: Each rows labels[i] must be a valid probability distribution. logits: Unscaled log probabilities. dim: The class dimension. Defaulted to -1, which is the last dimension. name: A name for the operation (optional). The preceding code snippet computes softmax cross entropy between logits and labels. While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. For exclusive labels, use (where one and only one class is true at a time) sparse_softmax_cross_entropy_with_logits. tf.nn.sparse_softmax_cross_entropy_with_logits sparse_softmax_cross_entropy_with_logits( _sentinel=None, labels=None, logits=None, name=None ) labels: Tensor of shape [d_0, d_1, ..., d_(r-1)] (where r is the rank of labels and result) and dtype, int32, or int64. Each entry in labels must be an index in [0, num_classes). Other values will raise an exception when this operation is run on the CPU and return NaN for corresponding loss and gradient rows on the GPU. logits: Unscaled log probabilities of shape [d_0, d_1, ..., d_(r-1), num_classes] and dtype, float32, or float64. The preceding code computes sparse softmax cross entropy between logits and labels. The probability of a given label is considered exclusive. Soft classes are not allowed, and the label's vector must provide a single specific index for the true class for each row of logits. tf.nn.weighted_cross_entropy_with_logits weighted_cross_entropy_with_logits( targets, logits, pos_weight, name=None ) targets: A tensor of the same type and shape as logits. logits: A tensor of type float32 or float64. pos_weight: A coefficient to use on the positive examples. This is similar to sigmoid_cross_entropy_with_logits() except that pos_weight allows a trade-off of recall and precision by up or down-weighting the cost of a positive error relative to a negative error. Analyzing the Iris dataset with a Tensorflow feedforward network Let's look at a feedforward example using the Iris dataset. You can download the dataset from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/iris.csv and the target labels from https://github.com/ml-resources/neuralnetwork-programming/blob/ed1/ch02/iris/target.csv. In the Iris dataset, we will use 150 rows of data made up of 50 samples from each of three Iris species: Iris setosa, Iris virginica, and Iris versicolor. Petal geometry compared from three iris species: Iris Setosa, Iris Virginica, and Iris Versicolor. In the dataset, each row contains data for each flower sample: sepal length, sepal width, petal length, petal width, and flower species. Flower species are stored as integers, with 0 denoting Iris setosa, 1 denoting Iris versicolor, and 2 denoting Iris virginica. First, we will create a run() function that takes three parameters--hidden layer size h_size, standard deviation for weights stddev, and Step size of Stochastic Gradient Descent sgd_step: def run(h_size, stddev, sgd_step) Input data loading is done using the genfromtxt function in numpy. The Iris data loaded has a shape of L: 150 and W: 4. Data is loaded in the all_X variable. Target labels are loaded from target.csv in all_Y with the shape of L: 150, W:3: def load_iris_data(): from numpy import genfromtxt data = genfromtxt('iris.csv', delimiter=',') target = genfromtxt('target.csv', delimiter=',').astype(int) # Prepend the column of 1s for bias L, W = data.shape all_X = np.ones((L, W + 1)) all_X[:, 1:] = data num_labels = len(np.unique(target)) all_y = np.eye(num_labels)[target] return train_test_split(all_X, all_y, test_size=0.33, random_state=RANDOMSEED) Once data is loaded, we initialize the weights matrix based on x_size, y_size, and h_size with standard deviation passed to the run() method: x_size= 5 y_size= 3 h_size= 128 (or any other number chosen for neurons in the hidden layer) # Size of Layers x_size = train_x.shape[1] # Input nodes: 4 features and 1 bias y_size = train_y.shape[1] # Outcomes (3 iris flowers) # variables X = tf.placeholder("float", shape=[None, x_size]) y = tf.placeholder("float", shape=[None, y_size]) weights_1 = initialize_weights((x_size, h_size), stddev) weights_2 = initialize_weights((h_size, y_size), stddev) Next, we make the prediction using sigmoid as the activation function defined in the forward_propagration() function: def forward_propagation(X, weights_1, weights_2): sigmoid = tf.nn.sigmoid(tf.matmul(X, weights_1)) y = tf.matmul(sigmoid, weights_2) return y First, sigmoid output is calculated from input X and weights_1. This is then used to calculate y as a matrix multiplication of sigmoid and weights_2: y_pred = forward_propagation(X, weights_1, weights_2) predict = tf.argmax(y_pred, dimension=1) Next, we define the cost function and optimization using gradient descent. Let's look at the GradientDescentOptimizer being used. It is defined in the tf.train.GradientDescentOptimizer class and implements the gradient descent algorithm. To construct an instance, we use the following constructor and pass sgd_step as a parameter: # constructor for GradientDescentOptimizer __init__( learning_rate, use_locking=False, name='GradientDescent' ) Arguments passed are explained here: learning_rate: A tensor or a floating point value. The learning rate to use. use_locking: If True, use locks for update operations. name: Optional name prefix for the operations created when applying gradients. The default name is "GradientDescent". The following list shows the code to implement the cost function: cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_pred)) updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) Next, we will implement the following steps: Initialize the TensorFlow session: sess = tf.Session() Initialize all the variables using tf.initialize_all_variables(); the return object is used to instantiate the session. Iterate over steps (1 to 50). For each step in train_x and train_y, execute updates_sgd. Calculate the train_accuracy and test_accuracy. We stored the accuracy for each step in a list so that we could plot a graph: init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = [] print("Step, train accuracy, test accuracy") for step in range(steps): # Train with each example for i in range(len(train_x)): sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]}) train_accuracy = np.mean(np.argmax(train_y, axis=1) == sess.run(predict, feed_dict={X: train_x, y: train_y})) test_accuracy = np.mean(np.argmax(test_y, axis=1) == sess.run(predict, feed_dict={X: test_x, y: test_y})) print("%d, %.2f%%, %.2f%%" % (step + 1, 100. * train_accuracy, 100. * test_accuracy)) test_acc.append(100. * test_accuracy) train_acc.append(100. * train_accuracy) Code execution Let's run this code for h_size of 128, standard deviation of 0.1, and sgd_step of 0.01: def run(h_size, stddev, sgd_step): ... def main(): run(128,0.1,0.01) if __name__ == '__main__': main() The preceding code outputs the following graph, which plots the steps versus the test and train accuracy: Let's compare the change in SGD steps and its effect on training accuracy. The following code is very similar to the previous code example, but we will rerun it for multiple SGD steps to see how SGD steps affect accuracy levels. def run(h_size, stddev, sgd_steps): .... test_accs = [] train_accs = [] time_taken_summary = [] for sgd_step in sgd_steps: start_time = time.time() updates_sgd = tf.train.GradientDescentOptimizer(sgd_step).minimize(cost) sess = tf.Session() init = tf.initialize_all_variables() steps = 50 sess.run(init) x = np.arange(steps) test_acc = [] train_acc = [] print("Step, train accuracy, test accuracy") for step in range(steps): # Train with each example for i in range(len(train_x)): sess.run(updates_sgd, feed_dict={X: train_x[i: i + 1], y: train_y[i: i + 1]}) train_accuracy = np.mean(np.argmax(train_y, axis=1) == sess.run(predict, feed_dict={X: train_x, y: train_y})) test_accuracy = np.mean(np.argmax(test_y, axis=1) == sess.run(predict, feed_dict={X: test_x, y: test_y})) print("%d, %.2f%%, %.2f%%" % (step + 1, 100. * train_accuracy, 100. * test_accuracy)) #x.append(step) test_acc.append(100. * test_accuracy) train_acc.append(100. * train_accuracy) end_time = time.time() diff = end_time -start_time time_taken_summary.append((sgd_step,diff)) t = [np.array(test_acc)] t.append(train_acc) train_accs.append(train_acc) Output of the preceding code will be an array with training and test accuracy for each SGD step value. In our example, we called the function sgd_steps for an SGD step value of [0.01, 0.02, 0.03]: def main(): sgd_steps = [0.01,0.02,0.03] run(128,0.1,sgd_steps) if __name__ == '__main__': main() This is the plot showing how training accuracy changes with sgd_steps. For an SGD value of 0.03, it reaches a higher accuracy faster as the step size is larger. In this post, we built our first neural network, which was feedforward only, and used it for classifying the contents of the Iris dataset. You enjoyed a tutorial from the book, Neural Network Programming with Tensorflow. To implement advanced neural networks like CCNs, RNNs, GANs, deep belief networks and others in Tensorflow, grab your copy today! Neural Network Architectures 101: Understanding Perceptrons How to Implement a Neural Network with Single-Layer Perceptron Deep Learning Algorithms: How to classify Irises using multi-layer perceptrons
Read more
  • 0
  • 0
  • 15658

article-image-pentaho-data-integration-4-understanding-data-flows
Packt
24 Jun 2011
12 min read
Save for later

Pentaho Data Integration 4: Understanding Data Flows

Packt
24 Jun 2011
12 min read
  Pentaho Data Integration 4 Cookbook Over 70 recipes to solve ETL problems using Pentaho Kettle         Read more about this book       This article by Adrián Sergio Pulvirenti and María Carina Roldán, authors of Pentaho Data Integration 4 Cookbook, focuses on the different ways for combining, splitting, or manipulating streams or flows of data using Kettle transformations. The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation. In this article, we will cover: Splitting a stream into two or more streams based on a condition Merging rows from two streams with the same or different structure Comparing two streams and generating differences Generating all possible pairs formed from two datasets (For more resources on this subject, see here.) Introduction The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation. When a transformation is launched, all its steps are started. During the execution, the steps work simultaneously reading rows from the incoming hops, processing them, and delivering them to the outgoing hops. When there are no more rows left, the execution of the transformation ends. The dataset that flows from step to step is not more than a set of rows all having the same structure or metadata. This means that all rows have the same number of columns, and the columns in all rows have the same type and name. Suppose that you have a single stream of data and that you apply the same transformations to all rows, that is, you have all steps connected in a row one after the other. In other words, you have the simplest of the transformations from the point of view of its structure. In this case, you don't have to worry much about the structure of your data stream, nor the origin or destination of the rows. The interesting part comes when you face other situations, for example: You want a step to start processing rows only after another given step has processed all rows You have more than one stream and you have to combine them into a single stream You have to inject rows in the middle of your stream and those rows don't have the same structure as the rows in your dataset With Kettle, you can actually do this, but you have to be careful because it's easy to end up doing wrong things and getting unexpected results or even worse: undesirable errors. With regard to the first example, it doesn't represent a default behavior due to the parallel nature of the transformations as explained earlier. There are two steps however, that might help, which are as follows: Blocking Step: This step blocks processing until all incoming rows have been processed. Block this step until steps finish: This step blocks processing until the selected steps finish. Both these steps are in the Flow category. This and the next article on Working with Complex Data Flows focuses on the other two examples and some similar use cases, by explaining the different ways for combining, splitting, or manipulating streams of data. Splitting a stream into two or more streams based on a condition In this recipe, you will learn to use the Filter rows step in order to split a single stream into different smaller streams. In the There's more section, you will also see alternative and more efficient ways for doing the same thing in different scenarios. Let's assume that you have a set of outdoor products in a text file, and you want to differentiate the tents from other kind of products, and also create a subclassification of the tents depending on their prices. Let's see a sample of this data: id_product,desc_product,price,category1,"Swedish Firesteel - Army Model",19,"kitchen"2,"Mountain House #10 Can Freeze-Dried Food",53,"kitchen"3,"Lodge Logic L9OG3 Pre-Seasoned 10-1/2-Inch RoundGriddle",14,"kitchen"... Getting ready To run this recipe, you will need a text file named outdoorProducts.txt with information about outdoor products. The file contains information about the category and price of each product. How to do it... Carry out the following steps: Create a transformation. Drag into the canvas a Text file input step and fill in the File tab to read the file named outdoorProducts.txt. If you are using the sample text file, type , as the Separator. Under the Fields tab, use the Get Fields button to populate the grid. Adjust the entries so that the grid looks like the one shown in the following screenshot: Now, let's add the steps to manage the flow of the rows. To do this, drag two Filter rows steps from the Flow category. Also, drag three Dummy steps that will represent the three resulting streams. Create the hops, as shown in the following screenshot. When you create the hops, make sure that you choose the options according to the image: Result is TRUE for creating a hop with a green icon, and Result is FALSE for creating a hop with a red icon in it. Double-click on the first Filter rows step and complete the condition, as shown in the following screenshot: Double-click on the second Filter rows step and complete the condition with price < 100. You have just split the original dataset into three groups. You can verify it by previewing each Dummy step. The first one has products whose category is not tents; the second one, the tents under 100 US$; and the last group, the expensive tents; those whose price is over 100 US$. The preview of the last Dummy step will show the following: How it works... The main objective in the recipe is to split a dataset with products depending on their category and price. To do this, you used the Filter rows step. In the Filter rows setting window, you tell Kettle where the data flows to depending on the result of evaluating a condition for each row. In order to do that, you have two list boxes: Send 'true' data to step and Send 'false' data to step. The destination steps can be set by using the hop properties as you did in the recipe. Alternatively, you can set them in the Filter rows setting dialog by selecting the name of the destination steps from the available drop-down lists. You also have to enter the condition. The condition has the following different parts: The upper textbox on the left is meant to negate the condition. The left textbox is meant to select the field that will be used for comparison. Then, you have a list of possible comparators to choose from. On the right, you have two textboxes: The upper textbox for comparing against a field and the bottom textbox for comparing against a constant value. Also, you can include more conditions by clicking on the Add Condition button on the right. If you right-click on a condition, a contextual menu appears to let you delete, edit, or move it. In the first Filter rows step of the recipe, you typed a simple condition: You compared a field (category) with a fixed value (tents) by using the equal (=) operator. You did this to separate the tents products from the others. The second filter had the purpose of differentiating the expensive and the cheap tents. There's more... You will find more filter features in the following subsections. Avoiding the use of Dummy steps In the recipe, we assumed that you wanted all three groups of products for further processing. Now, suppose that you only want the cheapest tents and you don't care about the rest. You could use just one Filter rows step with the condition category = tents AND price < 100, and send the 'false' data to a Dummy step, as in shown in the following diagram: The rows that don't meet the condition will end at the Dummy step. Although this is a very commonly used solution for keeping just the rows that meet the conditions, there is a simpler way to implement it. When you create the hop from the Filter rows toward the next step, you are asked for the kind of hop. If you choose Main output of step, the two options Send 'true' data to step and Send 'false' data to step will remain empty. This will cause two things: Only the rows that meet the condition will pass. The rest will be discarded. Comparing against the value of a Kettle variable The recipe above shows you how to configure the condition in the Filter rows step to compare a field against another field or a constant value, but what if you want to compare against the value of a Kettle variable? Let's assume, for example, you have a named parameter called categ with kitchen as Default Value. As you might know, named parameters are a particular kind of Kettle variable.   You create the named parameters under the Parameter tab from the Settings option of the Edit menu. To use this variable in a condition, you must add it to your dataset in advance. You do this as follows: Add a Get Variables step from the Job category. Put it in the stream after the Text file input step and before the Filter Rows step; use it to create a new field named categ of String type with the value ${categ} in the Variable column. Now, the transformation looks like the one shown in the following screenshot: After this, you can set the condition of the first Filter rows step to category = categ, selecting categ from the listbox of fields to the right. This way, you will be filtering the kitchen products. If you run the transformation and set the parameter to tents, you will obtain similar results to those that were obtained in the main recipe. Avoiding the use of nested Filter Rows steps Suppose that you want to compare a single field against a discrete and short list of possible values and do different things for each value in that list. In this case, you can use the Switch / Case step instead of nested Filter rows steps. Let's assume that you have to send the rows to different steps depending on the category. The best way to do this is with the Switch / Case step. This way you avoid adding one Filter row step for each category. In this step, you have to select the field to be used for comparing. You do it in the Field name to switch listbox. In the Case values grid, you set the Value—Target step pairs. The following screenshot shows how to fill in the grid for our particular problem: The following are some considerations about this step: You can have multiple values directed to the same target step You can leave the value column blank to specify a target step for empty values You have a listbox named Default target step to specify the target step for rows that do not match any of the case values You can only compare with an equal operator If you want to compare against a substring of the field, you could enable the Use string contains option and as Case Value, type the substring you are interested in. For example, if for Case Value, you type tent_ then all categories containing tent_ such as tent_large, tent_small, or best_tents will be redirected to the same target step. Overcoming the difficulties of complex conditions There will be situations where the condition is too complex to be expressed in a single Filter rows step. You can nest them and create temporary fields in order to solve the problem, but it would be more efficient if you used the Java Filter or User Defined Java Expression step as explained next. You can find the Java Filter step in the Flow category. The difference compared to the Filter Rows step is that in this step, you write the condition using a Java expression. The names of the listboxes—Destination step for matching rows (optional) and Destination step for non-matching rows (optional)—differ from the names in the Filter rows step, but their purpose is the same. As an example, the following are the conditions you used in the recipe rewritten as Java expressions: category.equals("tents") and price < 100. These are extremely simple, but you can write any Java expression as long as it evaluates to a Boolean result. If you can't guarantee that the category will not be null, you'd better invert the first expression and put "tents".equals(category) instead. By doing this, whenever you have to check if a field is equal to a constant, you avoid an unexpected Java error. Finally, suppose that you have to split the streams simply to set some fields and then join the streams again. For example, assume that you want to change the category as follows: Doing this with nested Filter rows steps leads to a transformation like the following: You can do the same thing in a simpler way: Replace all the steps but the Text file input with a User Defined Java Expression step located in the Scripting category. In the setting window of this step, add a row in order to replace the value of the category field: As New field and Replace value type category. As Value type select String. As Java expression, type the following: (category.equals("tents"))?(price<100?"cheap_tents":"expensive_tents"):category The preceding expression uses the Java ternary operator ?:. If you're not familiar with the syntax, think of it as shorthand for the if-then-else statement. For example, the inner expression price<100?"cheap_tents":"expensive_tents" means if (price<100) then return "cheap_tents" else return "expensive_tents". Do a preview on this step. You will see something similar to the following:
Read more
  • 0
  • 0
  • 15640
article-image-data-science-saved-christmas
Aaron Lazar
22 Dec 2017
9 min read
Save for later

How Data Science saved Christmas

Aaron Lazar
22 Dec 2017
9 min read
It’s the middle of December and it’s shivery cold in the North Pole at -20°C. A fat old man sits on a big brown chair, beside the fireplace, stroking his long white beard. His face has a frown on it, quite his unusual self. Mr. Claus quips, “Ruddy mailman should have been here by now! He’s never this late to bring in the li'l ones’ letters.” [caption id="attachment_3284" align="alignleft" width="300"] Nervous Santa Claus on Christmas Eve, he is sitting on the armchair and resting head on his hands[/caption] Santa gets up from his chair, his trouser buttons crying for help, thanks to his massive belly. He waddles over to the window and looks out. He’s sad that he might not be able to get the children their gifts in time, this year. Amidst the snow, he can see a glowing red light. “Oh Rudolph!” he chuckles. All across the living room are pictures of little children beaming with joy, holding their presents in their hands. A small smile starts building and then suddenly, Santa gets a new-found determination to get the presents over to the children, come what may! An idea strikes him as he waddles over to his computer room. Now Mr. Claus may be old on the outside, but on the inside, he’s nowhere close! He recently set up a new rig, all by himself. Six Nvidia GTX Titans, coupled with sixteen gigs of RAM, a 40-inch curved monitor that he uses to keep an eye on who’s being naughty or nice, and a 1000 watt home theater system, with surround sound, heavy on the bass. On the inside, he’s got a whole load of software on the likes of the Python language (not the Garden of Eden variety), OpenCV - his all-seeing eye that’s on the kids and well, Tensorflow et al. Now, you might wonder what an old man is doing with such heavy software and hardware. A few months ago, Santa caught wind that there’s a new and upcoming trend that involves working with tonnes of data, cleaning, processing and making sense of it. The idea of crunching data somehow tickled the old man and since then, the jolly good master tinkerer and his army of merry elves have been experimenting away with data. Santa’s pretty much self-taught at whatever he does, be it driving a sleigh or learning something new. A couple of interesting books he picked up from Packt were, Python Data Science Essentials - Second Edition, Hands-On Data Science and Python Machine Learning, and Python Machine Learning - Second Edition. After spending some time on the internet, he put together a list of things he needed to set up his rig and got them from Amazon. [caption id="attachment_3281" align="alignright" width="300"] Santa Claus is using a laptop on the top of a house[/caption] He quickly boots up the computer and starts up Tensorflow. He needs to come up with a list of probable things that each child would have wanted for Christmas this year. Now, there are over 2 billion children in the world and finding each one’s wish is going to be more than a task! But nothing is too difficult for Santa! He gets to work, his big head buried in his keyboard, his long locks falling over his shoulder. So, this was his plan: Considering that the kids might have shared their secret wish with someone, Santa plans to tackle the problem from different angles, to reach a higher probability of getting the right gifts: He plans to gather email and Social Media data from all the kids’ computers - all from the past month It’s a good thing kids have started owning phones at such an early age now - he plans to analyze all incoming and outgoing phone calls that have happened over the course of the past month He taps into every country's local police department’s records to stream all security footage all over the world [caption id="attachment_3288" align="alignleft" width="300"] A young boy wearing a red Christmas hat and red sweater is writing a letter to Santa Claus. The child is sitting at a wooden table in front of a Christmas tree.[/caption] If you’ve reached till here, you’re probably wondering whether this article is about Mr.Claus or Mr.Bond. Yes, the equipment and strategy would have fit an MI6 or a CIA agent’s role. You never know, Santa might just be a retired agent. Do they ever retire? Hmm! Anyway, it takes a while before he can get all the data he needs. He trusts Spark to sort this data in order, which is stored in a massive data center in his basement (he’s a bit cautious after all the news about data breaches). And he’s off to work! He sifts through the emails and messages, snorting from time to time at some of the hilarious ones. Tensorflow rips through the data, picking out keywords for Santa. It takes him a few hours to get done with the emails and social media data alone! By the time he has a list, it’s evening and time for supper. Santa calls it a day and prepares to continue the next day. The next day, Santa gets up early and boots up his equipment as he brushes and flosses. He plonks himself in the huge swivel chair in front of the monitor, munching on freshly baked gingerbread. He starts tapping into all the phone company databases across the world, fetching all the data into his data center. Now, Santa can’t afford to spend the whole time analyzing voices himself, so he lets Tensorflow analyze voices and segregate the keywords it picks up from the voice signals. Every kid’s name to a possible gift. Now there were a lot of unmentionable things that got linked to several kids names. Santa almost fell off his chair when he saw the list. “These kids grow up way too fast, these days!” It’s almost 7 PM in the evening when Santa realizes that there’s way too much data to process in a day. A few days later, Santa returns to his tech abode, to check up on the progress of the call data processing. There’s a huge list waiting in front of him. He thinks to himself, “This will need a lot of cleaning up!” He shakes his head thinking, I should have started with this! He now has to munge through that camera footage! Santa had never worked on so much data before so he started to get a bit worried that he might be unable to analyze it in time. He started pacing around the room trying to think up a workaround. Time was flying by and he still did not know how to speed up the video analyses. Just when he’s about to give up, the door opens and Beatrice walks in. Santa almost trips as he runs to hug his wife! Beatrice is startled for a bit but then breaks into a smile. “What is it dear? Did you miss me so much?” Santa replies, “You can’t imagine how much! I’ve been doing everything on my own and I really need your help!” Beatrice smiles and says, “Well, what are we waiting for? Let’s get down to it!” Santa explains the problem to Beatrice in detail and tells her how far he’s reached in the analysis. Beatrice thinks for a bit and asks Santa, “Did you try using Keras on top of TensorFlow?” Santa, blank for a minute, nods his head. Beatrice continues, “Well from my experience, Keras gives TensorFlow a boost of about 10%, which should help quicken the analysis. Santa just looks like he’s made the best decision marrying Beatrice and hugs her again! “Bea, you’re a genius!” he cries out. “Yeah, and don’t forget to use Matplotlib!” she yells back as Santa hurries back to his abode. He’s off to work again, this time saddling up Keras to work on top of TensorFlow. Hundreds and thousands of terabytes of video data flowing into the machines. He channels the output through OpenCV and ties it with TensorFlow to add a hint of Deep Learning. He quickly types out some Python scripts to integrate both the tools to create the optimal outcome. And then the wait begins. Santa keeps looking at his watch every half hour, hoping that the processing happens fast. The hardware has begun heating up quite a bit and he quickly races over to bring a cooler that’s across the room. While he waits for the videos to finish up, he starts working on sifting out the data from the text and audio. He remembers what Beatrice said and uses Matplotlib to visualize it. Soon he has a beautiful map of the world with all the children’s names and their possible gifts beside. Three days later, the video processing gets done Keras truly worked wonders for TensorFlow! Santa now has another set of data to help him narrow down the gift list. A few hours later he’s got his whole list visualized on Matplotlib. [caption id="attachment_3289" align="alignleft" width="300"] Santa Claus riding on sleigh with gift box against snow falling on fir tree forest[/caption] There’s one last thing left to do! He suits up in red and races out the door to Rudolph and the other reindeer, unties them from the fence and leads them over to the sleigh. Once they’re fastened, he loads up an empty bag onto the sleigh and it magically gets filled up. He quickly checks it to see if all is well and they’re off! It’s Christmas morning and all the kids are racing out of bed to rip their presents open! There are smiles all around and everyone’s got a gift, just as the saying goes! Even the ones who’ve been naughty have gotten gifts. Back in the North Pole, the old man is back in his abode, relaxing in an easy chair with his legs up on the table. The screen in front of him runs real-time video feed of kids all over the world opening up their presents. A big smile on his face, Santa turns to look out the window at the glowing red light amongst the snow, he takes a swig of brandy from a hip flask. Thanks to Data Science, this Christmas is the merriest yet!
Read more
  • 0
  • 0
  • 15622

article-image-getting-started-with-web-scraping-with-beautiful-soup
Richa Tripathi
02 Jan 2018
5 min read
Save for later

Getting started with Web Scraping with Beautiful Soup

Richa Tripathi
02 Jan 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Alberto Boschetti and Luca Massaron, titled Python Data Science Essentials - Second Edition. This book will take you beyond the fundamentals of data science by diving into the world of data visualizations, web development, and deep learning.[/box] In this article, we will learn to scrape a web page and learn to download it manually. This process happens more often than you might expect; and it's a very popular topic of interest in data science. For example: Financial institutions scrape the Web to extract fresh details and information about the companies in their portfolio. Newspapers, social networks, blogs, forums, and corporate websites are the ideal targets for these analyses. Advertisement and media companies analyze sentiment and the popularity of many pieces of the Web to understand people's reactions. Companies specialized in insight analysis and recommendation scrape the Web to understand patterns and model user behaviors. Comparison websites use the web to compare prices, products, and services, offering the user an updated synoptic table of the current situation. Unfortunately, understanding websites is a very hard work since each website is built and maintained by different people, with different infrastructures, locations, languages, and structures. The only common aspect among them is represented by the standard exposed language, which, most of the time, is HTML. That's why the vast majority of the web scrapers, available as of today, are only able to understand and navigate HTML pages in a general-purpose way. One of the most used web parsers is named BeautifulSoup. It's written in Python, and it's very stable and simple to use. Moreover, it's able to detect errors and pieces of malformed code in the HTML page (always remember that web pages are often human-made products and prone to errors). A complete description of Beautiful Soup would require an entire book; here we will see just a few bits. First at all, Beautiful Soup is not a crawler. In order to download a web page, we should use the urllib library, for example. Let's now download the code behind the William Shakespeare page on Wikipedia: In: import urllib.request url = 'https://en.wikipedia.org/wiki/William_Shakespeare' request = urllib.request.Request(url) response = urllib.request.urlopen(request) It's time to instruct Beautiful Soup to read the resource and parse it using the HTML parser: In: from bs4 import BeautifulSoup soup = BeautifulSoup(response, 'html.parser') Now the soup is ready, and can be queried. To extract the title, we can simply ask for the title attribute: In: soup.title Out: <title>William Shakespeare - Wikipedia, the free encyclopedia</title> As you can see, the whole title tag is returned, allowing a deeper investigation of the nested HTML structure. What if we want to know the categories associated with the Wikipedia page of William Shakespeare? It can be very useful to create a graph of the entry, simply recurrently downloading and parsing adjacent pages. We should first manually analyze the HTML page itself to figure out what's the best HTML tag containing the information we're looking for. Remember here the “no free lunch” theorem in data science: there are no auto- discovery functions, and furthermore, things can change if Wikipedia modifies its format. After a manual analysis, we discover that categories are inside a div named “mw-normal- catlinks”; excluding the first link, all the others are okay. Now it's time to program. Let's put into code what we've observed, printing for each category, the title of the linked page and the relative link to it: In: section = soup.find_all(id='mw-normal-catlinks')[0] for catlink in section.find_all("a")[1:]: print(catlink.get("title"), "->", catlink.get("href")) Out: Category:William Shakespeare -> /wiki/Category:William_Shakespeare Category:1564 births -> /wiki/Category:1564_births Category:1616 deaths -> /wiki/Category:1616_deaths Category:16th-century English male actors -> /wiki/Category:16th- century_English_male_actors Category:English male stage actors -> /wiki/Category:English_male_stage_actors Category:16th-century English writers -> /wiki/Category:16th- century_English_writers We've used the find_all method twice to find all the HTML tags with the text contained in the argument. In the first case, we were specifically looking for an ID; in the second case, we were looking for all the "a" tags. Given the output then, and using the same code with the new URLs, it's possible to download recursively the Wikipedia category pages, arriving at this point at the ancestor categories. A final note about scraping: always remember that this practice is not always allowed, and when so, remember to tune down the rate of the download (at high rates, the website's server may think you're doing a small-scale DoS attack and will probably blacklist/ban your IP address). For more information, you can read the terms and conditions of the website, or simply contact the administrators. Downloading data from various sites where there are copyright laws in place is bound to get you into real legal trouble. That’s also why most companies that employ web scraping use external vendors for this task, or have a special arrangement with the site owners. We learned the process of data scraping, its practical use case in the industries and finally we put the theory to practice by learning how to scrape data from web with the help of Beautiful Soup. If you enjoyed this excerpt, check out the book Python Data Science Essentials - Second Edition to know more about different libraries such as pandas and NumPy,  that can provide you with all the tools to load and effectively manage your data  
Read more
  • 0
  • 0
  • 15617
Modal Close icon
Modal Close icon