Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-creating-pivot-table
Packt
29 Aug 2013
8 min read
Save for later

Creating a pivot table

Packt
29 Aug 2013
8 min read
(For more resources related to this topic, see here.) A pivot table is the core business intelligence tool that helps to turn meaningless data from various sources to a meaningful result. By using different ways of presenting data, we are able to identify relations between seemingly separate data and reach conclusions to help us identify our strengths and areas of improvement. Getting ready Prepare the two files entitled DatabaseData_v2.xlsx and GDPData_v2.xlsx.We will be using these results along with other data sources to create a meaningful PowerPivot table that will be used for intelligent business analysis. How to do it... For each of the two files, we will build upon the file and add a pivot table to it, gaining exposure using the data we are already familiar with. The following are the steps to create a pivot table with the DatabaseData_v2.xlsx file, which results in the creation of a DatabaseData_v3.xlsx file: Open the PowerPivot window of the DatabaseData_v2.xlsx file with its 13 tables. Click on the PivotTable button near the middle of the top row and save as New Worksheet. Select the checkboxes as shown in the following screenshot: Select CountryRegion | Name and move it under Row Labels‰ Select Address | City and move it under Row Labels Select Address | AddressLine1 as Count of AddressLine1 and move it under Values Now, this shows the number of clients per city and per country. However, it is very difficult to navigate, as each country name has to be collapsed in order to see the next country. Let us move the CountryRegion | Name column to Slicers Vertical. Now, the PowerPivot Field List dashboard should appear as shown in the following screenshot: Now, the pivot table should display simple results: the number of clients in a region, filterable by the country using slicers. Let us apply some formatting to allow for a better understanding of the data. Right-click on Name under the Slicers Vertical area of the PowerPivot Field List dashboard. Select Field Settings, then change the name to Country Name. We now see that the title of the slicer has changed from Name to Country Name, allowing anyone who views this data to understand better what the data represents. Similarly, right-click on Count of AddressLine1 under Values, select Edit Measure, and then change its name to Number of Clients. Also change the data title City under the Row Labels area to City Name. The result should appear as shown in the following screenshot: Let's see our results change as we click on different country names. We can filter for multiple countries by holding the Ctrl key while clicking, and can remove all filters by clicking the small button on the top-right of slicers. This is definitely easier to navigate through and to understand compared to what we did at first without using slicers, which is how it would appear in Excel 2010 without PowerPivot. However, this table is still too big. Clicking on Canada gives too many cities whose names many of us have not heard about before. Let us break the data further down by including states/provinces. Select StateProvince | Name and move it under Slicers Horizontal and change its title to State Name. It is a good thing that we are renaming these as we go along. Otherwise, there would have been two datasets called Name, and anyone would be confused as we moved along. Now, we should see the state names filter on the top, the country name filter on the left, and a list of cities with the number of clients in the middle part. This, however, is kind of awkward. Let us rearrange the filters by having the largest filter (country) at the top and the sub-filter. (state) on the left-hand side This can be done simply by dragging the Country Name dataset to Slicers Horizontal and State Name to Slicers Vertical. After moving the slicers around a bit, the result should appear as shown in the following screenshot: Again, play with the results and try to understand the features: try filtering by a country—and by a state/province—now there are limited numbers of cities shown for each country and each state/province, making it easier to see the list of cities. However, for countries such as the United States, there are just too many states. Let us change the formatting of the vertical filter to display three states per line, so it is easier to find the state we are looking for. This can be done by right-clicking on the vertical filter, selecting Size and Properties| Position and Layout, and then by changing the Number of Columns value. Repeat the same step for Country Name to display six columns and then change the sizes of the filters to look more selectable. Change the name of the sheet as PivotTable and then save the file as DatabaseData_v3.xlsx. The following are the steps to create a pivot table with the GDPData_v2.xlsx file, which results in the creation of a GDPData_v3.xlsx file: Open the PowerPivot window of the GDPData_v2.xlsx file with its two tables. Click on the PivotTable button near the middle of the top row and save as New Worksheet. Move the dataset from the year 2000 to the year 2010 to the Value field, and move Country Name in the Row Labels field, and Country Name again into the Slicers Horizontal field. In the slicer, select five countries: Canada, China, Korea, Japan, and United States as shown in the following screenshot: Select all fields and reduce the number of its decimal places. We can now clearly see that GDP in China has tripled over the decade, and that only China and Korea saw an increase in GDP from 2008 to 2009 while the GDP of other nations dropped due to the 2008 financial crisis. Knowing the relevant background information of world finance events, we can make intelligent analysis such as which markets to invest in if we are worried about another financial crisis taking place. As the data get larger in size, looking at the GDP number becomes increasingly difficult. In such cases, we can switch the type of data displayed by using available buttons in the PivotTable Tools | Options menu, the Show Value As button. Play around with it and see how it works: % of Column Total shows each GDP as a percentage of the year, while % Different From allows the user to set one value as the standard and compare the rest to it, and the Rank Largest to Smallest option simply shows the ranking based on which country earns the most GDP. Change the name of the sheet as PivotTable and then save the file as GDPData_v3.xlsx. How it works... We looked at two different files and focused on two different fields. The first file was more qualitative and showed the relationship between regions and number of clients, using various features of pivot tables such as slicers. We also looked at how to format various aspects of a pivot table for easier processing and for a better understanding of the represented data. Slicers embedded in the pivot table are a unique and very powerful feature of PowerPivot that allow us to sort through data simply by clicking the different criteria. The increasing numbers of slicers help to customize the data further, enabling the user to create all sorts of data imaginable. There are no differences in horizontal and vertical slicers aside from the fact that they are at different locations. From the second file, we focused more on the quantitative data and different ways of representing the data. By using slicers to limit the number of countries, we were able to focus more on the data presented, and manage to represent the GDP in various formats such as percentages and ranks, and were able to compare the difference between the numbers by selecting one as a standard. A similar method of representing data in a different format could be applied to the first file to show the percentage of clients per nation, and so on. There's more... We covered the very basic setup of creating a pivot table. We can also analyze creating relationships between data and creating custom fields, so that better results are created. So don't worry about why the pivot table looks so small! For those who do not like working with a pivot table, there is also a feature that will convert all cells into Excel formula. Under the PowerPivot Tools | OLAP Tools option, the Convert to Formula button does exactly that. However, be warned that it cannot be undone as the changes are permanent. Summary In this article, we learned how to use the raw data to make some pivot tables that can help us make smart business decisions! Resources for Article: Further resources on this subject: SAP HANA integration with Microsoft Excel [Article] Managing Core Microsoft SQL Server 2008 R2 Technologies [Article] Eloquent relationships [Article]
Read more
  • 0
  • 0
  • 1920

article-image-schemas-and-models
Packt
27 Aug 2013
12 min read
Save for later

Schemas and Models

Packt
27 Aug 2013
12 min read
(For more resources related to this topic, see here.) So what is a schema? At its simplest, a schema is a way to describe the structure of data. Typically this involves giving each piece of data a label, and stating what type of data it is, for example, a number, date, string, and so on. In the following example, we are creating a new Mongoose schema called userSchema. We are stating that a database document using this schema will have three pieces of data, which are as follows: name: This data will contain a string email: This will also contain a string value createdOn: This data will contain a date The following is the schema definition: var userSchema = new mongoose.Schema({name: String,email: String,createdOn: Date}); Field sizes Note that, unlike some other systems there is no need to set the field size. This can be useful if you need to change the amount of data stored in a particular object. For example, your system might impose a 16-character limit on usernames, so you set the size of the field to 16 characters. Later, you realize that you want to encrypt the usernames, but this will double the length of the data stored. If your database schema uses fixed field sizes, you will need to refactor it, which can take a long time on a large database. With Mongoose, you can just start encrypting that data object without worrying about it. If you're storing large documents, you should bear in mind that MongoDB imposes a maximum document size of 16 MB. There are ways around even this limit, using the MongoDB GridFS API. Data types allowed in schemas There are eight types of data that can—by default—be set in a Mongoose schema. These are also referred to as SchemaTypes; they are: String Number Date Boolean Buffer ObjectId Mixed Array The first four SchemaTypes are self-explanatory, but let's take a quick look at them all. String This SchemaType stores a string value, UTF-8 encoded. Number This SchemaType stores a number value, with restrictions. Mongoose does not natively support long and double datatypes for example, although MongoDB does. However, Mongoose can be extended using plugins to support these other types. Date This SchemaType holds a date and time object, typically returned from MongoDB as an ISODate object, for example, ISODate("2013-04-03T12:56:26.009Z"). Boolean This SchemaType has only two values: true or false. Buffer This SchemaType is primarily used for storing binary information, for example, images stored in MongoDB. ObjectId This SchemaType is used to assign a unique identifier to a key other than _id. Rather than just specifying the type of ObjectId you need to specify the fully qualified version Schema.Types.ObjectId. For example: projectSchema.add({owner: mongoose.Schema.Types.ObjectId}); Mixed A mixed data object can contain any type of data. It can be declared either by setting an empty object, or by using the fully qualified Schema.Types.Mixed. These following two commands will do the same thing: vardjSchema= new mongoose.Schema({mixedUp: {}});vardjSchema= new mongoose.Schema({mixedUp: Schema.Types.Mixed}); While this sounds like it might be great, there is a big caveat. Changes to data of mixed type cannot be automatically detected by Mongoose, so it doesn't know that it needs to save them. Tracking changes to Mixed type As Mongoose can't automatically see changes made to mixed type of data, you have to manually declare when the data has changed. Fortunately, Mongoose exposes a method called markModified to do just this, passing it the path of the data object that has changed. dj.mixedUp = { valueone: "a new value" };dj.markModified('mixedUp');dj.save(); Array The array datatype can be used in two ways. First, a simple array of values of the same data type, as shown in the following code snippet: var userSchema = new mongoose.Schema({name: String,emailAddresses: [String]}); Second, the array datatype can be used to store a collection of subdocuments using nested schemas. Here's an example in the following of how this can work: var emailSchema = new mongoose.Schema({email: String,verified: Boolean});var userSchema = new mongoose.Schema({name: String,emailAddresses: [emailSchema]}); Warning – array defined as mixed type A word of caution. If you declare an empty array it will be treated as mixed type, meaning that Mongoose will not be able to automatically detect any changes made to the data. So avoid these two types of array declaration, unless you intentionally want a mixed type. var emailSchema = new mongoose.Schema({addresses: []});var emailSchema = new mongoose.Schema({addresses: Array}); Custom SchemaTypes If your data requires a different datatype which is not covered earlier in this article, Mongoose offers the option of extending it with custom SchemaTypes. The extension method is managed using Mongoose plugins. Some examples of SchemaType extensions that have already been created are: long, double, RegEx, and even email. Where to write the schemas As your schemas sit on top of Mongoose, the only absolute is that they need to be defined after Mongoose is required. You don't need an active or open connection to define your schemas. That being said it is advisable to make your connection early on, so that it is available as soon as possible, bearing in mind that remote database or replica sets may take longer to connect than your localhost development server. While no action can be taken on the database through the schemas and models until the connection is open, Mongoose can buffer requests made from when the connection is defined. Mongoose models also rely on the connection being defined, so there's another reason to get the connection set up early in the code and then define the schemas and models. Writing a schema Let's write the schema for a User in our MongoosePM application. The first thing we have to do is declare a variable to hold the schema. I recommend taking the object name (for example, user or project) and adding Schema to the end of it. This makes following the code later on super easy. The second thing we need to do is create a new Mongoose schema object to assign to this variable. The skeleton of this is as follows: var userSchema = new mongoose.Schema({ }); We can add in the basic values of name, email, and createdOn that we looked at earlier, giving us our first user schema definition. var userSchema = new mongoose.Schema({name: String,email: String,createdOn: Date}); Modifying an existing schema Suppose we run the application with this for a while, and then decide that we want to record the last time each user logged on, and the last time their record was modified. No problem! We don't have to refactor the database or take it offline while we upgrade the schema, we simply add a couple of entries to the Mongoose schema. If a key requested in the schema doesn't exist, neither Mongoose nor MongoDB will throw errors, Mongoose will just return null values. When saving the MongoDB documents, the new keys and values will be added and stored as required. If the value is null, then the key is not added. So let's add modifiedOn and lastLogin to our userSchema: var userSchema = new mongoose.Schema({name: String,email: String,createdOn: Date,modifiedOn: Date,lastLogin: Date}); Setting a default value Mongoose allows us to set a default value for a data key when the document is first created. Looking at our schema created earlier, a possible candidate for this is createdOn. When a user first signs up, we want the date and time to be set. We could do this by adding a timestamp to the controller function when we create a user, or to make a point we can modify the schema to set a default value. To do this, we need to change the information we are sending about the createdOn data object. What we have currently is: createdOn: Date This is short for: createdOn: { type: Date } We can add another entry to this object to set a default value here, using the JavaScript Date object: createdOn: { type: Date, default: Date.now } Now every time a new user is created its createdOn value will be set to the current date and time. Note that in JavaScript default is a reserved word. While the language allows reserved words to be used as keys, some IDEs and linters regard it as an error. If this causes issues for you or your environment, you can wrap it in quotes, like in the following code snippet: createdOn: { type: Date, 'default': Date.now } Only allowing unique entries If we want to ensure that there is only ever one user per e-mail address, we can specify that the email field should be unique. email: {type: String, unique:true} With this in place, when saving to the database, MongoDB will check to see if the e-mail value already exists in another document. If it finds it, MongoDB (not Mongoose) will return an E11000 error. Note that this approach also defines a MongoDB index on the email field. Our final User schema Your userSchema should now look like the following: var userSchema = new mongoose.Schema({name: String,email: {type: String, unique:true},createdOn: { type: Date, default: Date.now },modifiedOn: Date,lastLogin: Date}); A corresponding document from the database would look like the following (line breaks are added for readability): { "__v" : 0,"_id" : ObjectId("5126b7a1f8a44d1e32000001"),"createdOn" : ISODate("2013-02-22T00:11:13.436Z"),"email" : "simon@theholmesoffice.com","lastLogin" : ISODate("2013-04-03T12:54:42.734Z"),"modifiedOn" : ISODate("2013-04-03T12:56:26.009Z"),"name" : "Simon Holmes" } What's that "__v" thing? You may have noticed a data entity in the document that we didn't set: __v. This is an internal versioning number automatically set by Mongoose when a document is created. It doesn't increment when a document is changed, but instead is automatically incremented whenever an array within the document is updated in such a way that might cause the indexed position of some of the entries to have changed. Why is this needed? When working with an array you will typically access the individual elements through their positional index, for example, myArray[3]. But what happens if somebody else deletes the element in myArray[2] while you are editing the data in myArray[3]? Your original data is now contained in myArray[2] but you don't know this, so you quite happily overwrite whatever data is now stored in myArray[3]. The __v gives you a method to be able to sanity check this, and prevent this scenario from happening. Defining the Project schema As part of our MongoosePM application we also need to think about Projects. After all, PM here does stand for Project Manager. Let's take what we've learned and create the Project schema. We are going to want a few types of data to start with: projectName: A string containing the name of the project. createdOn: The date when the document was first created and saved. This option is set to automatically save the current date and time. modifiedOn: The date and time when the document was last changed. createdBy: A string that will for now contain the unique ID of the user who created the project. tasks: A string to hold task information. Transforming these requirements into a Mongoose schema definition, we create this in the following: varprojectSchema = new mongoose.Schema({projectName: String,createdOn: Date,modifiedOn: { type: Date, default: Date.now },createdBy: String,tasks: String}); This is our starting point, and we will build upon it. For now we have these basic data objects as mentioned previously in this article. Here's an example of a corresponding document from the database (line breaks added for readability): { "projectName" : "Another test","createdBy" : "5126b7a1f8a44d1e32000001","createdOn" : ISODate("2013-04-03T17:47:51.031Z"),"tasks" : "Just a simple task","_id" : ObjectId("515c6b47596acf8e35000001"),"modifiedOn" : ISODate("2013-04-03T17:47:51.032Z"),"__v" : 0 } Improving the Project schema Throughout the rest of the article we will be improving this schema, but the beauty of using Mongoose is that we can do this relatively easily. Putting together a basic schema like this to build upon is a great approach for prototyping—you have the data you need there, and can add complexity where you need, when you need it. Building models A single instance of a model maps directly to a single document in the database. With this 1:1 relationship, it is the model that handles all document interaction—creating, reading, saving, and deleting. This makes the model a very powerful tool. Building the model is pretty straightforward. When using the default Mongoose connection we can call the mongoose.model command, passing it two arguments: The name of the model The name of the schema to compile So if we were to build a model from our user schema we would use this line: mongoose.model( 'User', userSchema ); If you're using a named Mongoose connection, the approach is very similar. adminConnection.model( 'User', userSchema ); Instances It is useful to have a good understanding of how a model works. After building the User model, using the previous line we could create two instances. var userOne = new User({ name: 'Simon' });var userTwo = new User({ name: 'Sally' }); Summary In this article, we have looked at how schemas and models relate to your data. You should now understand the roles of both schemas and models. We have looked at how to create simple schemas and the types of data they can contain. We have also seen that it is possible to extend this if the native types are not enough. In the MongoosePM project, you should now have added a User schema and a Project schema, and built models of both of these. Resources for Article: Further resources on this subject: Understanding Express Routes [Article] Validating and Using the Model Data [Article] Creating Your First Web Page Using ExpressionEngine: Part 1 [Article]
Read more
  • 0
  • 0
  • 3735

article-image-creating-multivariate-charts
Packt
26 Aug 2013
10 min read
Save for later

Creating Multivariate Charts

Packt
26 Aug 2013
10 min read
(For more resources related to this topic, see here.) With increasing number of variables, any analysis can become challenging and any observations harder; however, Tableau simplifies the process for the designer and uses effective layouts for the reader even in multivariate analysis. Using various combinations of colors and charts, we can create compelling graphics that generate critical insights from our data. Among the charts covered in this article, facets and area charts are easier to understand and easier to create compared to bullet graphs and dual axes charts. Creating facets Facets are one of the powerful features in Tableau. Edward Tufte, a pioneer in the field of information graphics, championed these types of charts, also called grid or panel charts; he called them small multiples. These charts show the same measure(s) across various values of one or two variables for easier comparison. Getting ready Let's use the sample file Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data file is loaded on the new worksheet, perform the following steps to create a simple faceted chart: Drag-and-drop Market from Dimensions into the Columns shelf. Drag-and-drop Product Type from Dimensions into the Rows shelf. Drag-and-drop Profit from Measures into the Rows shelf next to Product Type. Optionally, you can drag-and-drop Market into the Color Marks box to give color to the four bars of different Market areas. The chart should look like the one in the following screenshot: How it works... When there is one dimension on one of the shelves, either Columns or Rows, and one measure on the other shelf, Tableau creates a univariate bar chart, but when we drop additional dimensions along with the measure, Tableau creates small charts or facets and displays univariate charts broken down by a dimension. There's more... A company named Juice Analytics has a great blog article on the topic of small multiples. This article lists the benefits of using small multiples as well as some examples of small multiples in practice. Find this blog at http://www.juiceanalytics.com/writing/better-know-visualization-small-multiples/. Creating area charts An area chart is an extension of a line chart. The area chart shows the line of the measure but fills the area below the line to emphasize on the value of the measure. A special case of area chart is a stacked area chart, which shows a line per measure and the area between the lines is filled. Tableau's implementation of area charts uses one date variable and one or more measures. Getting ready Let's use the sample file Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded on the new worksheet, perform the following steps to create an area chart: Click on the Show Me button to bring the Show Me toolbar on the screen. Select Order Date from Dimensions and Order Quantity from Measures by clicking and holding the Ctrl key. Click on Area charts (continuous) from the Show Me toolbar. Drag-and-drop Order Date into the Columns shelf next to YEAR(Order Date. Expand YEAR(Order Date), seen on the right-hand side, by clicking on the plus sign. Drag-and-drop Region from Dimensions into the the Rows shelf to the left of SUM(Order Quantity). The chart should look like the one in the following screenshot: How it works… When we added Order Date for the first time, Tableau, by default, aggregated the date field by year; therefore, we added Order Date again to create aggregation by quarter of the Order Date. We also added Region to create facets on the regions that provide trends of order quantity over time. There's more... A blog post by visually, an information graphics company, discusses the key differences between line charts and area charts. You can find this post at http://blog.visual.ly/line-vs-area-charts/. Creating bullet graphs Stephen Few, an information visualization consultant and author, designed this chart to solve some of the problems that the gauges and meters type of charts poses. Gauges, although simple to understand, take a lot of space to show only one measure. Bullet graphs are a combination of the bar graph and thermometer types of charts, and they show a measure of interest in the form of a bar graph (which is the bullet) and target variables. Getting ready Let's use the sample file Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data is loaded on the sheet, perform the following steps to create a bullet graph: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Type and Market from Dimensions and Budget Sales and Sales from Measures. Click on the bullet graphs icon on the Show Me toolbar. Right-click on the x axis (the Budget Sales axis) and click on Swap Reference Line Fields. The final chart should look like the one in the following screenshot: How it works... Although bullet graphs maximize the available space to show relevant information, readers require detailed explanation as to what all the components of the graphic are encoding. In this recipe, since we want to compare the budgeted sales with the actual sales, we had to swap the reference line from Sales to Budget Sales. The black bar on the graphic shows the budgeted sales and the blue bar shows the actual sales. The dark gray background color shows 60 percent of the actual sales and the lighter gray shows 80 percent of the actual sales. As we can see in this chart, blue bars crossed all the black lines, and that tells us that both the coffee types and all market regions exceeded the budgeted sales. There's more... A blog post by Data Pig Technologies discusses some of the problems with the bullet graph. The main problem is intuitive understanding of this chart. You can read about this problem and the reply by Stephen Few at http://datapigtechnologies.com/blog/index.php/the-good-and-bad-of-bullet-graphs/. Creating dual axes charts Dual axes charts are useful to compare two similar types of measures that may have different types of measurement units, such as pounds and dollars. In this recipe, we will look at the dual axes chart. Getting ready Let's use the same sample file, Sample – Coffee Chain (Access). Open a new worksheet and select Sample – Coffee Chain (Access) as the data source. How to do it... Once the data is loaded on the sheet, perform the following steps to create a dual axes chart: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Date, Type, and Market from Dimensions and Sales and Budget Sales from Measures. Click on the dual line graph icon on the Show Me toolbar. Click-and-drag Market from the Rows shelf into the Columns shelf. Right-click on the Sales vertical axis and click on Synchronize Axis. The chart should look like the one shown in the following screenshot: How it works... Tableau will create two vertical axes and automatically place Sales on one dual axes charts vertical axis and Budget Sales on the other. The scales on both the vertical axes are different, however. By synchronizing the axes, we get the same scales on both axes for better comparison and accurate representation of the patterns. Creating Gantt charts Gantt charts are most commonly used in project management as these charts show various activities and tasks with the time required to complete those tasks. Gantt charts are even more useful when they show dependencies among various tasks. This type of chart is very helpful when the number of activities is low (around 20-30), otherwise the chart becomes too big to be understood easily. Getting ready Let's use the sample file Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded, perform the following steps to create a Gantt chart: Click on Analysis from the top menu toolbar, and if Aggregate Measures is checked, click on it again to uncheck that option. Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Order Date and Category (under Products) from Dimensions and Time to Ship from Measures. Click on the Gantt chart icon on the Show Me toolbar. Drag-and-drop Order Date into the Filters pane. Select Years from the Filter Field [Order Date] options dialog box and hit Next. Check 2012 from the list and hit OK. Right-click on YEAR(Order Date) on the Columns shelf and select the Day May 8, 2011 option. Drag-and-drop Order Date into the Filters pane. Select Months from the Filter Field [Order Date] options dialog box and hit Next. Check December from the list and hit OK. Drag-and-drop Region from Dimensions into the Color Marks input box. Drag-and-drop Region from Dimensions into the Rows shelf before Category. The generated Gantt chart should look like the one in the following screenshot: How it works... Representing time this way helps the reader to discern which activity took the longest amount of time. We added the Order Date field two times in the Filters pane to first filter for the year 2012 and then for the month of December. In this recipe, out of all the products shipped in December of 2012, we can easily see the red bars for the West region in the Office Supplies category is longer, suggesting that these products took the longest amount of time to ship. There's more... Andy Kriebel, a Tableau data visualization expert, has a great example of Gantt charts using US presidential data. The following link shows the lengths of terms in office of Presidents from various parties: http://vizwiz.blogspot.com/2010/09/tableau-tip-creating-waterfall-chart.html Creating heat maps A heat map is a visual representation of numbers in a table or a grid such that the bigger numbers are encoded by darker colors or bigger sizes and the smaller numbers by lighter colors or smaller sizes. This type of representation makes the reader's pattern detection from the data easier. Getting ready Let's use the same sample file, Sample – Superstore Sales (Excel). Open a new worksheet and select Sample – Superstore Sales (Excel) as the data source. How to do it... Once the data is loaded, perform the following steps to create a heat map chart: Click on the Show Me button to bring the Show Me toolbar on the screen. While holding the Ctrl key, click on Sub-Category (under Products), Region, and Ship Mode from Dimensions and Profit from Measures. Click on the heat maps chart icon on the Show Me toolbar. Drag-and-drop Profit from Measures into the Color Marks box. The generated chart should look like the one in the following screenshot: Summary When we created the chart for the first time, Tableau assigned various sizes to the square boxes, but when we placed Profit as a color mark, red was used for low amounts of profit and green was used for higher amounts of profit. This made spotting of patterns very easy. Binders and Binder Accessories, shipped by Regular Air in the Central region, generated very high amounts of profit and Tables, shipped by Delivery Trucks in the East region, generated very low amounts of profit (it actually created losses for the company). Resources for Article: Further resources on this subject: Constructing and Evaluating Your Design Solution [Article] Basic use of Local Storage [Article] Creating Interactive Graphics and Animation [Article]
Read more
  • 0
  • 0
  • 4243

article-image-making-big-data-work-hadoop-and-solr
Packt
23 Aug 2013
8 min read
Save for later

Making Big Data Work for Hadoop and Solr

Packt
23 Aug 2013
8 min read
(For more resources related to this topic, see here.) Understanding data processing workflows Based on the data, configuration, and the requirements, data can be processed at multiple levels while it is getting ready for search. Cascading and LucidWorks Big Data are few such application platforms with which a complex data processing workflow can be rapidly developed on the Hadoop framework. In Cascading, the data is processed in different phases, with each phase containing a pipe responsible for carrying data units and applying a filter. The following diagram shows how incoming data can be processed in the pipeline-based workflow: Once a data is passed through the workflow, it can be persisted at the end with repository, and later synced with various nodes running in a distributed environment. The pipelining technique offers the following advantages: Apache Solr engine has minimum work to handle while index creation Incremental indexing can be supported By introducing intermediate store, you can have regular data backups at required stages The data can be transferred to a different type of storage such as HDFS directly through multiple processing units The data can be merged, joined, and processed as per the needs for different data sources LucidWorks Big Data is a more powerful product which helps the user to generate bulk indexes on Hadoop, allowing them to classify and analyze the data, and provide distributed searching capabilities. Sharding is a process of breaking one index into multiple logical units called shards across multiple records. In case of Solr, the results will be aggregated and returned. Big Data based technologies can be used with Apache Solr for various operations. Index creation itself can be made to run on distributed system in order to speed up the overall index generation activity. Once that is done, it can be distributed on different nodes participating in Big Data, and Solr can be made to run in a distributed manner for searching the data. You can set up your Solr instance in the following different configurations: Standalone machine This configuration uses single high end server containing indexes and Solr search; it is suitable for development, and in some cases, production. Distributed setup A distributed setup is suitable for large scale indexes where the index is difficult to store on one system. In this case index has to be distributed across multiple machines. Although distributed configuration of Solr offers ample flexibility in terms of processing, it has its own limitations. A lot of features of Apache Solr such as MoreLikeThis and Joins are not supported. The following diagram depicts the distributed setup: Replicated mode In this mode, more than one Solr instance exists; among them the master instance provides shared access to its slaves for replicating the indexes across multiple systems. Master continues to participate in index creation, search, and so on. Slaves sync up the storage through various replication techniques such as rsync utility. By default, Solr includes Java-based replication that uses HTTP protocol for communication. This replication is recommended due to its benefits over other external replication techniques. This mode is not used anymore with the release of Solr 4.x versions. Sharded mode This mode combines the best of both the worlds and brings in the real value of distributed system with high availability. In this configuration, the system has multiple masters, and each master holds multiple slaves where the replication has gone through. Load balancer is used to handle the load on multiple nodes equally. The following diagram depicts the distributed and replicated setup: If Apache Solr is deployed on a Hadoop-like framework, it falls into this category. Solr also provides SolrCloud for distributed Solr. We are going to look at different approaches in the next section. Using Solr 1045 patch – map-side indexing The work for Solr-1045 patch started with a goal to achieve index generation/building using the Apache MapReduce task. Solr-1045 patch converts all the input records to a set of <key, value> pairs in each map task that runs on Hadoop. Further it goes on creating SolrInputDocument from the <key, value>, and later creating the Solr indexes.The following diagram depicts this process: Reduce tasks can be used to perform deduplication of indexes, and merge them together if required. Although merge index seems to be an interesting feature, it is actually a costly affair in terms of processing, and you will not find many implementations with merge index functionality. Once the indexes are created, you can load them on your Solr instance and use them for searching. You can download this particular patch from https://issues.apache.org/jira/browse/SOLR-1045, and patch your Solr instance. To apply a patch to your Solr instance, you need to first build your Solr instance using source. You can download the patch from Apache JIRA. Before running the patch, first do a dry run which does not actually apply patch. You can do it with following command: cd <solr-trunk-dir>svn patch <name-of-patch> --dry-run If it is successful, you can run the patch without the –dry-run option to apply the patch. Let's look at some of the important classes in the patch. Important class Description SolrIndexUpdateMapper This class is a Hadoop mapper responsible for creating indexes out of <key, value> pairs of input. SolrXMLDocRecordReader This class is responsible for reading Solr input XML files. SolrIndexUpdater This class creates a MapReduce job configuration, runs the job to read the document, and updates the Solr instance. Right now it is built using the Lucene index updater. Benefits and drawbacks The following are the benefits and drawbacks of using the Solr-1045 patch: Benefits It achieves complete parallelism by index creation right at the map task. Merging of indexes is possible in the reduce phase of MapReduce. Drawbacks When the indexing is done at map-side, all the <key, value> pairs received by reducer gain equal weight/importance. So, it is difficult to use this patch with data that carries ranking/weight information. Using Solr 1301 patch – reduce-side indexing This patch focuses on using the Apache MapReduce framework for index creation. Keyword search can happen over Apache Solr or Apache SolrCloud. Unlike Solr-1045, in this patch, the indexes are created in the reduce phase of MapReduce. In this patch, a map task is responsible for converting input records to a <key, value> pair; later, they are passed to the reducer, which in turn converts them into SolrInputDocument, and then creates indexes out of it. This index is then passed as outputs of Hadoop MapReduce process. The following diagram depicts this process: To use Solr-1301 patch, you need to set up a Hadoop cluster. Once the index is created through Hadoop patch, it should then be provisioned to Solr server. The patch contains default converter for CSV files. Let's look at some of the important classes which are part of this patch. Important class Description CSVDocumentConverter This class is responsible for converting output of the map task, that is, key-value pair to SolrInputDocument; you can have multiple document converters. CSVReducer This is a reducer code implemented for Hadoop reducers. CSVIndexer This is the main class to be called from your command line for creating indexes using MapReduce. You need to provide input path for your data and output path for storing shards. SolrDocumentConverter This class is used in your map task for converting your objects in Solr document. SolrRecordWriter This class is an extension of mapreduce.RecordWriter; it breaks the data into multiple (key, value) pairs which are then converted into collection of SolrInputDocument(s), and then this data is submitted to SolrEmbeddedServer in batches. Once completed, it will commit the changes and run the optimizer on the embedded server. CSVMapper This class parses CSV file and gets key-value pair out of it. This is a mapper class. SolrOutputFormat This class is responsible for converting key-value pairs to write the data on file/HDFS as zip/raw format. Perform the following steps to run this patch: Create a local folder with configuration and library folder, conf containing Solr configuration (solr-config.xml, schema.xml), and lib containing library. Create your own converter class implementing SolrDocumentConverter; this will be used by SolrOutputFormat to convert output records to Solr document. You may also override the OutputFormat class provided by Solr. Write the Hadoop MapReduce job in the configuration writer: SolrOutputFormat.setupSolrHomeCache(newFile(solrConfigDir), conf);conf.setOutputFormat(SolrOutputFormat.class);SolrDocumentConverter.setSolrDocumentConverter(<yourclassname>.class, conf); Zip your configuration, and load it in HDFS. The ZIP file name should be solr.zip (unless you change the patch code). Now run the patch, each of the jobs will instantiate EmbeddedSolrInstance which will in turn do the conversion, and finally the SolrOutputDocument(s) get stored in the output format. Benefits and drawbacks The following are the benefits and drawbacks of using Solr-1301 patch: Benefits With reduced size index generation, it is possible to preserve the weights of documents, which can contribute while performing a prioritization during a search query. Drawbacks Merging of indexes is not possible like in Solr-1045, as the indexes are created in the reduce phase. Reducer becomes the crucial component of the system due to major tasks being performed. Summary In this article, we have understood different possible approaches of how Big Data can be made to work with Apache Hadoop and Solr. We also looked at the benefits of and drawbacks these approaches. Resources for Article : Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Apache Solr Configuration [Article] Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate) [Article]
Read more
  • 0
  • 0
  • 5323

article-image-learning-how-classify-real-world-examples
Packt
22 Aug 2013
24 min read
Save for later

Learning How to Classify with Real-world Examples

Packt
22 Aug 2013
24 min read
(For more resources related to this topic, see here.) The Iris dataset The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification. The setting is that of Iris flowers, of which there are multiple species that can be identified by their morphology. Today, the species would be defined by their genomic signatures, but in the 1930s, DNA had not even been identified as the carrier of genetic information. The following four attributes of each plant were measured: Sepal length Sepal width Petal length Petal width In general, we will call any measurement from our data as features. Additionally, for each plant, the species was recorded. The question now is: if we saw a new flower out in the field, could we make a good prediction about its species from its measurements? This is the supervised learning or classification problem; given labeled examples, we can design a rule that will eventually be applied to other examples. This is the same setting that is used for spam classification; given the examples of spam and ham (non-spam e-mail) that the user gave the system, can we determine whether a new, incoming message is spam or not? For the moment, the Iris dataset serves our purposes well. It is small (150 examples, 4 features each) and can easily be visualized and manipulated. The first step is visualization Because this dataset is so small, we can easily plot all of the points and all two-dimensional projections on a page. We will thus build intuitions that can then be extended to datasets with many more dimensions and datapoints. Each subplot in the following screenshot shows all the points projected into two of the dimensions. The outlying group (triangles) are the Iris Setosa plants, while Iris Versicolor plants are in the center (circle) and Iris Virginica are indicated with "x" marks. We can see that there are two large groups: one is of Iris Setosa and another is a mixture of Iris Versicolor and Iris Virginica. We are using Matplotlib; it is the most well-known plotting package for Python. We present the code to generate the top-left plot. The code for the other plots is similar to the following code: from matplotlib import pyplot as plt from sklearn.datasets import load_iris import numpy as np # We load the data with load_iris from sklearn data = load_iris() features = data['data'] feature_names = data['feature_names'] target = data['target'] for t,marker,c in zip(xrange(3),">ox","rgb"): # We plot each class on its own to get different colored markers plt.scatter(features[target == t,0], features[target == t,1], marker=marker, c=c) Building our first classification model If the goal is to separate the three types of flower, we can immediately make a few suggestions. For example, the petal length seems to be able to separate Iris Setosa from the other two flower species on its own. We can write a little bit of code to discover where the cutoff is as follows: plength = features[:, 2] # use numpy operations to get setosa features is_setosa = (labels == 'setosa') # This is the important step: max_setosa =plength[is_setosa].max() min_non_setosa = plength[~is_setosa].min() print('Maximum of setosa: {0}.'.format(max_setosa)) print('Minimum of others: {0}.'.format(min_non_setosa)) This prints 1.9 and 3.0. Therefore, we can build a simple model: if the petal length is smaller than two, this is an Iris Setosa flower; otherwise, it is either Iris Virginica or Iris Versicolor. if features[:,2] < 2: print 'Iris Setosa' else: print 'Iris Virginica or Iris Versicolour' This is our first model, and it works very well in that it separates the Iris Setosa flowers from the other two species without making any mistakes. What we had here was a simple structure; a simple threshold on one of the dimensions. Then we searched for the best dimension threshold. We performed this visually and with some calculation; machine learning happens when we write code to perform this for us. The example where we distinguished Iris Setosa from the other two species was very easy. However, we cannot immediately see what the best threshold is for distinguishing Iris Virginica from Iris Versicolor. We can even see that we will never achieve perfect separation. We can, however, try to do it the best possible way. For this, we will perform a little computation. We first select only the non-Setosa features and labels: features = features[~is_setosa] labels = labels[~is_setosa] virginica = (labels == 'virginica') Here we are heavily using NumPy operations on the arrays. is_setosa is a Boolean array, and we use it to select a subset of the other two arrays, features and labels. Finally, we build a new Boolean array, virginica, using an equality comparison on labels. Now, we run a loop over all possible features and thresholds to see which one results in better accuracy. Accuracy is simply the fraction of examples that the model classifies correctly: best_acc = -1.0 for fi in xrange(features.shape[1]): # We are going to generate all possible threshold for this feature thresh = features[:,fi].copy() thresh.sort() # Now test all thresholds: for t in thresh: pred = (features[:,fi] > t) acc = (pred == virginica).mean() if acc > best_acc: best_acc = acc best_fi = fi best_t = t The last few lines select the best model. First we compare the predictions, pred, with the actual labels, virginica. The little trick of computing the mean of the comparisons gives us the fraction of correct results, the accuracy. At the end of the for loop, all possible thresholds for all possible features have been tested, and the best_fi and best_t variables hold our model. To apply it to a new example, we perform the following: if example[best_fi] > t: print 'virginica' else: print 'versicolor' What does this model look like? If we run it on the whole data, the best model that we get is split on the petal length. We can visualize the decision boundary. In the following screenshot, we see two regions: one is white and the other is shaded in grey. Anything that falls in the white region will be called Iris Virginica and anything that falls on the shaded side will be classified as Iris Versicolor: In a threshold model, the decision boundary will always be a line that is parallel to one of the axes. The plot in the preceding screenshot shows the decision boundary and the two regions where the points are classified as either white or grey. It also shows (as a dashed line) an alternative threshold that will achieve exactly the same accuracy. Our method chose the first threshold, but that was an arbitrary choice. Evaluation – holding out data and cross-validation The model discussed in the preceding section is a simple model; it achieves 94 percent accuracy on its training data. However, this evaluation may be overly optimistic. We used the data to define what the threshold would be, and then we used the same data to evaluate the model. Of course, the model will perform better than anything else we have tried on this dataset. The logic is circular. What we really want to do is estimate the ability of the model to generalize to new instances. We should measure its performance in instances that the algorithm has not seen at training. Therefore, we are going to do a more rigorous evaluation and use held-out data. For this, we are going to break up the data into two blocks: on one block, we'll train the model, and on the other—the one we held out of training—we'll test it. The output is as follows: Training error was 96.0%. Testing error was 90.0% (N = 50). The result of the testing data is lower than that of the training error. This may surprise an inexperienced machine learner, but it is expected and typical. To see why, look back at the plot that showed the decision boundary. See if some of the examples close to the boundary were not there or if one of the ones in between the two lines was missing. It is easy to imagine that the boundary would then move a little bit to the right or to the left so as to put them on the "wrong" side of the border. The error on the training data is called a training error and is always an overly optimistic estimate of how well your algorithm is doing. We should always measure and report the testing error; the error on a collection of examples that were not used for training. These concepts will become more and more important as the models become more complex. In this example, the difference between the two errors is not very large. When using a complex model, it is possible to get 100 percent accuracy in training and do no better than random guessing on testing! One possible problem with what we did previously, which was to hold off data from training, is that we only used part of the data (in this case, we used half of it) for training. On the other hand, if we use too little data for testing, the error estimation is performed on a very small number of examples. Ideally, we would like to use all of the data for training and all of the data for testing as well. We can achieve something quite similar by cross-validation. One extreme (but sometimes useful) form of cross-validation is leave-one-out. We will take an example out of the training data, learn a model without this example, and then see if the model classifies this example correctly: error = 0.0 for ei in range(len(features)): # select all but the one at position 'ei': training = np.ones(len(features), bool) training[ei] = False testing = ~training model = learn_model(features[training], virginica[training]) predictions = apply_model(features[testing], virginica[testing], model) error += np.sum(predictions != virginica[testing]) At the end of this loop, we will have tested a series of models on all the examples. However, there is no circularity problem because each example was tested on a model that was built without taking the model into account. Therefore, the overall estimate is a reliable estimate of how well the models would generalize. The major problem with leave-one-out cross-validation is that we are now being forced to perform 100 times more work. In fact, we must learn a whole new model for each and every example, and this will grow as our dataset grows. We can get most of the benefits of leave-one-out at a fraction of the cost by using x-fold cross-validation; here, "x" stands for a small number, say, five. In order to perform five-fold cross-validation, we break up the data in five groups, that is, five folds. Then we learn five models, leaving one fold out of each. The resulting code will be similar to the code given earlier in this section, but here we leave 20 percent of the data out instead of just one element. We test each of these models on the left out fold and average the results: The preceding figure illustrates this process for five blocks; the dataset is split into five pieces. Then for each fold, you hold out one of the blocks for testing and train on the other four. You can use any number of folds you wish. Five or ten fold is typical; it corresponds to training with 80 or 90 percent of your data and should already be close to what you would get from using all the data. In an extreme case, if you have as many folds as datapoints, you can simply perform leave-one-out cross-validation. When generating the folds, you need to be careful to keep them balanced. For example, if all of the examples in one fold come from the same class, the results will not be representative. We will not go into the details of how to do this because the machine learning packages will handle it for you. We have now generated several models instead of just one. So, what final model do we return and use for the new data? The simplest solution is now to use a single overall model on all your training data. The cross-validation loop gives you an estimate of how well this model should generalize. A cross-validation schedule allows you to use all your data to estimate if your methods are doing well. At the end of the cross-validation loop, you can use all your data to train a final model. Although it was not properly recognized when machine learning was starting out, nowadays it is seen as a very bad sign to even discuss the training error of a classification system. This is because the results can be very misleading. We always want to measure and compare either the error on a held-out dataset or the error estimated using a cross-validation schedule. Building more complex classifiers In the previous section, we used a very simple model: a threshold on one of the dimensions. Throughout this article, you will see many other types of models, and we're not even going to cover everything that is out there. What makes up a classification model? We can break it up into three parts: The structure of the model: In this, we use a threshold on a single feature. The search procedure: In this, we try every possible combination of feature and threshold. The loss function: Using the loss function, we decide which of the possibilities is less bad (because we can rarely talk about the perfect solution). We can use the training error or just define this point the other way around and say that we want the best accuracy. Traditionally, people want the loss function to be minimum. We can play around with these parts to get different results. For example, we can attempt to build a threshold that achieves minimal training error, but we will only test three values for each feature: the mean value of the features, the mean plus one standard deviation, and the mean minus one standard deviation. This could make sense if testing each value was very costly in terms of computer time (or if we had millions and millions of datapoints). Then the exhaustive search we used would be infeasible, and we would have to perform an approximation like this. Alternatively, we might have different loss functions. It might be that one type of error is much more costly than another. In a medical setting, false negatives and false positives are not equivalent. A false negative (when the result of a test comes back negative, but that is false) might lead to the patient not receiving treatment for a serious disease. A false positive (when the test comes back positive even though the patient does not actually have that disease) might lead to additional tests for confirmation purposes or unnecessary treatment (which can still have costs, including side effects from the treatment). Therefore, depending on the exact setting, different trade-offs can make sense. At one extreme, if the disease is fatal and treatment is cheap with very few negative side effects, you want to minimize the false negatives as much as you can. With spam filtering, we may face the same problem; incorrectly deleting a non-spam e-mail can be very dangerous for the user, while letting a spam e-mail through is just a minor annoyance. What the cost function should be is always dependent on the exact problem you are working on. When we present a general-purpose algorithm, we often focus on minimizing the number of mistakes (achieving the highest accuracy). However, if some mistakes are more costly than others, it might be better to accept a lower overall accuracy to minimize overall costs. Finally, we can also have other classification structures. A simple threshold rule is very limiting and will only work in the very simplest cases, such as with the Iris dataset. A more complex dataset and a more complex classifier We will now look at a slightly more complex dataset. This will motivate the introduction of a new classification algorithm and a few other ideas. Learning about the Seeds dataset We will now look at another agricultural dataset; it is still small, but now too big to comfortably plot exhaustively as we did with Iris. This is a dataset of the measurements of wheat seeds. Seven features are present, as follows: Area (A) Perimeter (P) Compactness () Length of kernel Width of kernel Asymmetry coefficient Length of kernel groove There are three classes that correspond to three wheat varieties: Canadian, Koma, and Rosa. As before, the goal is to be able to classify the species based on these morphological measurements. Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset, and its features were automatically computed from digital images. This is how image pattern recognition can be implemented: you can take images in digital form, compute a few relevant features from them, and use a generic classification system. Later, we will work through the computer vision side of this problem and compute features in images. For the moment, we will work with the features that are given to us. UCI Machine Learning Dataset Repository The University of California at Irvine (UCI) maintains an online repository of machine learning datasets (at the time of writing, they are listing 233 datasets). Both the Iris and Seeds dataset used in this article were taken from there. The repository is available online: http://archive.ics.uci.edu/ml/ Features and feature engineering One interesting aspect of these features is that the compactness feature is not actually a new measurement, but a function of the previous two features, area and perimeter. It is often very useful to derive new combined features. This is a general area normally termed feature engineering; it is sometimes seen as less glamorous than algorithms, but it may matter more for performance (a simple algorithm on well-chosen features will perform better than a fancy algorithm on not-so-good features). In this case, the original researchers computed the "compactness", which is a typical feature for shapes (also called "roundness"). This feature will have the same value for two kernels, one of which is twice as big as the other one, but with the same shape. However, it will have different values for kernels that are very round (when the feature is close to one) as compared to kernels that are elongated (when the feature is close to zero). The goals of a good feature are to simultaneously vary with what matters and be invariant with what does not. For example, compactness does not vary with size but varies with the shape. In practice, it might be hard to achieve both objectives perfectly, but we want to approximate this ideal. You will need to use background knowledge to intuit which will be good features. Fortunately, for many problem domains, there is already a vast literature of possible features and feature types that you can build upon. For images, all of the previously mentioned features are typical, and computer vision libraries will compute them for you. In text-based problems too, there are standard solutions that you can mix and match. Often though, you can use your knowledge of the specific problem to design a specific feature. Even before you have data, you must decide which data is worthwhile to collect. Then, you need to hand all your features to the machine to evaluate and compute the best classifier. A natural question is whether or not we can select good features automatically. This problem is known as feature selection. There are many methods that have been proposed for this problem, but in practice, very simple ideas work best. It does not make sense to use feature selection in these small problems, but if you had thousands of features, throwing out most of them might make the rest of the process much faster. Nearest neighbor classification With this dataset, even if we just try to separate two classes using the previous method, we do not get very good results. Let me introduce, therefore, a new classifier: the nearest neighbor classifier. If we consider that each example is represented by its features (in mathematical terms, as a point in N-dimensional space), we can compute the distance between examples. We can choose different ways of computing the distance, for example: def distance(p0, p1): 'Computes squared euclidean distance' return np.sum( (p0-p1)**2) Now when classifying, we adopt a simple rule: given a new example, we look at the dataset for the point that is closest to it (its nearest neighbor) and look at its label: def nn_classify(training_set, training_labels, new_example): dists = np.array([distance(t, new_example) for t in training_set]) nearest = dists.argmin() return training_labels[nearest] In this case, our model involves saving all of the training data and labels and computing everything at classification time. A better implementation would be to actually index these at learning time to speed up classification, but this implementation is a complex algorithm. Now, note that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples have exactly the same features but different labels, which can happen). Therefore, it is essential to test using a cross-validation protocol. Using ten folds for cross-validation for this dataset with this algorithm, we obtain 88 percent accuracy. As we discussed in the earlier section, the cross-validation accuracy is lower than the training accuracy, but this is a more credible estimate of the performance of the model. We will now examine the decision boundary. For this, we will be forced to simplify and look at only two dimensions (just so that we can plot it on paper). In the preceding screenshot, the Canadian examples are shown as diamonds, Kama seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as white, black, and grey. You might be wondering why the regions are so horizontal, almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22 while the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x is actually much larger than a small change in y. So, when we compute the distance according to the preceding function, we are, for the most part, only taking the x axis into account. If you have a physics background, you might have already noticed that we had been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to Z-scores. The Z-score of a value is how far away from the mean it is in terms of units of standard deviation. It comes down to this simple pair of operations: # subtract the mean for each feature: features -= features.mean(axis=0) # divide each feature by its standard deviation features /= features.std(axis=0) Independent of what the original values were, after Z-scoring, a value of zero is the mean and positive values are above the mean and negative values are below it. Now every feature is in the same unit (technically, every feature is now dimensionless; it has no units) and we can mix dimensions more confidently. In fact, if we now run our nearest neighbor classifier, we obtain 94 percent accuracy! Look at the decision space again in two dimensions; it looks as shown in the following screenshot: The boundaries are now much more complex and there is interaction between the two dimensions. In the full dataset, everything is happening in a seven-dimensional space that is very hard to visualize, but the same principle applies: where before a few dimensions were dominant, now they are all given the same importance. The nearest neighbor classifier is simple, but sometimes good enough. We can generalize it to a k-nearest neighbor classifier by considering not just the closest point but the k closest points. All k neighbors vote to select the label. k is typically a small number, such as 5, but can be larger, particularly if the dataset is very large. Binary and multiclass classification The first classifier we saw, the threshold classifier, was a simple binary classifier (the result is either one class or the other as a point is either above the threshold or it is not). The second classifier we used, the nearest neighbor classifier, was a naturally multiclass classifier (the output can be one of several classes). It is often simpler to define a simple binary method than one that works on multiclass problems. However, we can reduce the multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset in a haphazard way; we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions: Is it an Iris Setosa (yes or no)? If no, check whether it is an Iris Virginica (yes or no). Of course, we want to leave this sort of reasoning to the computer. As usual, there are several solutions to this multiclass reduction. The simplest is to use a series of "one classifier versus the rest of the classifiers". For each possible label l, we build a classifier of the type "is this l or something else?". When applying the rule, exactly one of the classifiers would say "yes" and we would have our solution. Unfortunately, this does not always happen, so we have to decide how to deal with either multiple positive answers or no positive answers. Alternatively, we can build a classification tree. Split the possible labels in two and build a classifier that asks "should this example go to the left or the right bin?" We can perform this splitting recursively until we obtain a single label. The preceding diagram depicts the tree of reasoning for the Iris dataset. Each diamond is a single binary classifier. It is easy to imagine we could make this tree larger and encompass more decisions. This means that any classifier that can be used for binary classification can also be adapted to handle any number of classes in a simple way. There are many other possible ways of turning a binary method into a multiclass one. There is no single method that is clearly better in all cases. However, which one you use normally does not make much of a difference to the final result. Most classifiers are binary systems while many real-life problems are naturally multiclass. Several simple protocols reduce a multiclass problem to a series of binary decisions and allow us to apply the binary models to our multiclass problem. Summary In a sense, this was a very theoretical article, as we introduced generic concepts with simple examples. We went over a few operations with a classic dataset. This, by now, is considered a very small problem. However, it has the advantage that we were able to plot it out and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The intuitions we gained here will all still be valid. Classification means generalizing from examples to build a model (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine learning. We also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must, instead, evaluate it on testing data that was not used for training. In order to not waste too many examples in testing, a cross-validation schedule can get us the best of both worlds (at the cost of more computation). We also had a look at the problem of feature engineering. Features are not something that is predefined for you, but choosing and designing features is an integral part of designing a machine-learning pipeline. In fact, it is often the area where you can get the most improvements in accuracy as better data beats fancier methods. In this article, we wrote all of our own code (except when we used NumPy, of course). We needed to build up intuitions on simple cases to illustrate the basic concepts. Resources for Article: Further resources on this subject: Python Testing: Installing the Robot Framework [Article] Getting Started with Spring Python [Article] Creating Skeleton Apps with Coily in Spring Python [Article]
Read more
  • 0
  • 0
  • 3477

article-image-what-hazelcast
Packt
22 Aug 2013
10 min read
Save for later

What is Hazelcast?

Packt
22 Aug 2013
10 min read
(For more resources related to this topic, see here.) Starting out as usual In most modern software systems, data is the key. For more traditional architectures, the role of persisting and providing access to your system's data tends to fall to a relational database. Typically this is a monolithic beast, perhaps with a degree of replication, although this tends to be more for resilience rather than performance. For example, here is what a traditional architecture might look like (which hopefully looks rather familiar) This presents us with an issue in terms of application scalability, in that it is relatively easy to scale our application layer by throwing more hardware at it to increase the processing capacity. But the monolithic constraints of our data layer would only allow us to do this so far before diminishing returns or resource saturation stunted further performance increases; so what can we do to address this? In the past and in legacy architectures, the only solution would be to increase the performance capability of our database infrastructure, potentially by buying a bigger, faster server or by further tweaking and fettling the utilization of currently available resources. Both options are dramatic, either in terms of financial cost and/or manpower; so what else could we do? Data deciding to hang around In order for us to gain a bit more performance out of our existing setup, we can hold copies of our data away from the primary database and use these in preference wherever possible. There are a number of different strategies we could adopt, from transparent second-level caching layers to external key-value object storage. The detail and exact use of each varies significantly depending on the technology or its place in the architecture, but the main desire of these systems is to sit alongside the primary database infrastructure and attempt to protect it from an excessive load. This would then tend to lead to an increased performance of the primary database by reducing the overall dependency on it. However, this strategy tends to be only particularly valuable as a short-term solution, effectively buying us a little more time before the database once again starts to reach saturation. The other downside is that it only protects our database from read-based load; if our application is predominately write-heavy, this strategy has very little to offer. So our expanded architecture could look a bit like the following figure: Therein lies the problem However, in insulating the database from the read load, we have introduced a problem in the form of a cache consistency issue, in that, how does our local data cache deal with changing data underneath it within the primary database? The answer is rather depressing: it can't! The exact manifestation of any issues will largely depend on the data needs of the application and how frequently the data changes; but typically, caching systems will operate in one of the two following modes to combat the problem: Time bound cache: Holds entries for a defined period (time-to-live or TTL) Write through cache: Holds entries until they are invalidated by subsequent updates Time bound caches almost always have consistency issues, but at least the amount of time that the issue would be present is limited to the expiry time of each entry. However, we must consider the application's access to this data, because if the frequency of accessing a particular entry is less than the cache expiry time of it, the cache is providing no real benefit. Write through caches are consistent in isolation and can be configured to offer strict consistency, but if multiple write through caches exist within the overall architecture, then there will be consistency issues between them. We can avoid this by having a more intelligent cache, which features a communication mechanism between nodes, that can propagate entry invalidations to each other. In practice, an ideal cache would feature a combination of both features, so that entries would be held for a known maximum time, but also passes around invalidations as changes are made. So our evolved architecture would look a bit like the following figure: So far we've had a look through the general issues in scaling our data layer, and introduced strategies to help combat the trade-offs we will encounter along the way; however, the real world isn't quite as simple. There are various cache servers and in-memory database products in this area: however, most of these are stand-alone single instances, perhaps with some degree of distribution bolted on or provided by other supporting technologies. This tends to bring about the same issues we experienced with just our primary database, in that we could encounter resource saturation or capacity issues if the product is a single instance, or if the distribution doesn't provide consistency control, perhaps inconsistent data, which might harm our application. Breaking the mould Hazelcast is a radical new approach to data, designed from the ground up around distribution. It embraces a new scalable way of thinking; in that data should be shared around for both resilience and performance, while allowing us to configure the trade-offs surrounding consistency as the data requirements dictate. The first major feature to understand about Hazelcast is its master less nature; each node is configured to be functionally the same. The oldest node in the cluster is the de facto leader and manages the membership, automatically delegating as to which node is responsible for what data. In this way as new nodes join or dropout, the process is repeated and the cluster rebalances accordingly. This makes Hazelcast incredibly simple to get up and running, as the system is self-discovering, self-clustering, and works straight out of the box. However, the second feature to remember is that we are persisting data entirely in-memory; this makes it incredibly fast but this speed comes at a price. When a node is shutdown, all the dta that was held on it is lost. We combat this risk to resilience through replication, by holding enough copies of a piece of data across multiple nodes. In the event of failure, the overall cluster will not suffer any data loss. By default, the standard backup count is 1, so we can immediately enjoy basic resilience. But don't pull the plug on more than one node at a time, until the cluster has reacted to the change in membership and reestablished the appropriate number of backup copies of data. So when we introduce our new master less distributed cluster, we get something like the following figure: We previously identified that multi-node caches tend to suffer from either saturation or consistency issues. In the case of Hazelcast, each node is the owner of a number of partitions of the overall data, so the load will be fairly spread across the cluster. Hence, any saturation would be at the cluster level rather than any individual node. We can address this issue simply by adding more nodes. In terms of consistency, by default the backup copies of the data are internal to Hazelcast and not directly used, as such we enjoy strict consistency. This does mean that we have to interact with a specific node to retrieve or update a particular piece of data; however, exactly which node that is an internal operational detail and can vary over time — we as developers never actually need to know. If we imagine that our data is split into a number of partitions, that each partition slice is owned by one node and backed up on another, we could then visualize the interactions like the following figure: This means that for data belonging to Partition 1, our application will have to communicate to Node 1, Node 2 for data belonging to Partition 2, and so on. The slicing of the data into each partition is dynamic; so in practice, where there are more partitions than nodes, each node will own a number of different partitions and hold backups for others. As we have mentioned before, all of this is an internal operational detail, and our application does not need to know it, but it is important that we understand what is going on behind the scenes. Moving to new ground So far we have been talking mostly about simple persisted data and caches, but in reality, we should not think of Hazelcast as purely a cache, as it is much more powerful than just that. It is an in-memory data grid that supports a number of distributed collections and features. We can load in data from various sources into differing structures, send messages across the cluster; take out locks to guard against concurrent activity, and listen to the goings on inside the workings of the cluster. Most of these implementations correspond to a standard Java collection, or function in a manner comparable to other similar technologies, but all with the distribution and resilience capabilities already built in. Standard utility collections Map: Key-value pairs List: Collection of objects? Set: Non-duplicated collection Queue: Offer/poll FIFO collection Specialized collection Multi-Map: Key-list of values collection Lock: Cluster wide mutex Topic: Publish/subscribe messaging Concurrency utilities AtomicNumber: Cluster-wide atomic counter IdGenerator: Cluster-wide unique identifier generation Semaphore: Concurrency limitation CountdownLatch: Concurrent activity gate-keeping Listeners: Application notifications as things happen In addition to data storage collections, Hazelcast also features a distributed executor service allowing runnable tasks to be created that can be run anywhere on the cluster to obtain, manipulate, and store results. We could have a number of collections containing source data, then spin up a number of tasks to process the disparate data (for example, averaging or aggregating) and outputting the results into another collection for consumption. Again, just as we could scale up our data capacities by adding more nodes, we can also increase the execution capacity in exactly the same way. This essentially means that by building our data layer around Hazelcast, if our application needs rapidly increase, we can continuously increase the number of nodes to satisfy seemingly extensive demands, all without having to redesign or re-architect the actual application. With Hazelcast, we are dealing more with a technology than a server product, a library to build a system around rather than retrospectively bolting it on, or blindly connecting to an off-the-shelf commercial system. While it is possible (and in some simple cases quite practical) to run Hazelcast as a separate server-like cluster and connect to it remotely from our application; some of the greatest benefits come when we develop our own classes and tasks run within it and alongside it. With such a large range of generic capabilities, there is an entire world of problems that Hazelcast can help solve. We can use the technology in many ways; in isolation to hold data such as user sessions, run it alongside a more long-term persistent data store to increase capacity, or shift towards performing high performance and scalable operations on our data. By moving more and more responsibility away from monolithic systems to such a generic scalable one, there is no limit to the performance we can unlock. This will allow us to keep our application and data layers separate, but enabling the ability to scale them up independently as our application grows. This will avoid our application becoming a victim of its own success, while hopefully taking the world by storm. Summary In this article, we learned about Hazelcast. With such a large range of generic capabilities, Hazelcast can solve a world of problems. Resources for Article: Further resources on this subject: JBoss AS Perspective [Article] Drools Integration Modules: Spring Framework and Apache Camel [Article] JBoss RichFaces 3.3 Supplemental Installation [Article]
Read more
  • 0
  • 0
  • 5980
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
Packt
22 Aug 2013
7 min read
Save for later

Pentaho – Using Formulas in Our Reports

Packt
22 Aug 2013
7 min read
(For more resources related to this topic, see here.) At the end of the article, we propose that you make some modifications to the report created in this article. Starting practice In this article, we will create a copy of the report, then we will do the necessary changes in its layout; the final result is as follows: As we can observe in the previous screenshot, the rectangle that is to the left of each title changes color. We'll see how to do this, and much more, shortly. Time for action – making a copy of the previous report In this article, we will use an already created report. To do so, we will open it and save it with the name 09_Using_Formulas.prpt. Then we will modify its layout to fit this article. Finally, we will establish default values for our parameters. The steps for making a copy of the previous report are as follows: We open the report 07_Adding_Parameters.prpt that we created. Next, we create a copy by going to File | Save As... and saving it with the name 09_Using_Formulas.prpt. We will modify our report so that it looks like the following screenshot: As you can see, we have just added a rectangle in the Details section, a label (Total) in the Details Header section, and we have modified the name of the label found in the Report Header section. To easily differentiate this report from the one used previously, we have also modified its colors to grayscale. Later in this article, we will make the color of the rectangle vary according to the formula, so itis important that the rest of the report does not have too many colors so the result are easy for the end user to see. We will establish default values in our parameters so we can preview the report without delays caused by having to choose the values for ratings, year, and month. We go to the Data tab, select the SelectRating parameter, right-click on it, and choose the Edit Parameter... option: In Default Value, we type the value [G]: Next, we click on OK to continue. We should do something similar for SelectYear and SelectMonth: For SelectYear, the Default Value will be 2005. For SelectMonth, the Default Value will be 5. Remember that the selector shows the names of the months, but internally the months' numbers are used; so, 5 represents May. What just happened? We created a copy of the report 07_Adding_Parameters.prpt and saved it with the name 09_Using_Formulas.prpt. We changed the layout of the report, adding new objects and changing the colors. Then we established default values for the parameters SelectRating, SelectYear, and SelectMonth. Formulas To manage formulas, PRD implements the open standard OpenFormula. According to OpenFormula's specifications: "OpenFormula is an open format for exchanging recalculated formulas between office application implementations, particularly for spreadsheets. OpenFormula defines the types, syntax, and semantics for calculated formulas, including many predefined functions and operations, so that formulas can be exchanged between applications and produce substantively equal outputs when recalculated with equal inputs. Both closed and open source software can implement OpenFormula." For more information on OpenFormula, refer to the following links: Wikipedia: http://en.wikipedia.org/wiki/OpenFormula Specifications: https://www.oasis-open.org/committees/download.php/16826/openformula-spec-20060221.html Web: http ://www.openformula.org/ Pentaho wiki: http://wiki.pentaho.com/display/Reporting/Formula+Expressions Formulas are used for greatly varied purposes, and their use depends on the result one wants to obtain. Formulas let us carry out simple and complex calculations based on fixed and variable values and include predefined functions that let us work with text, databases, date and time, let us make calculations, and also include general information functions and user-defined functions. They also use logical operators (AND, OR, and so on) and comparative operators (>, <, and so on). Creating formulas There are two ways to create formulas: By creating a new function and by going to Common | Open Formula By pressing the button in a section's / an object's Style or Attributes tab, or to configure some feature In the report we are creating in this article, we will create formulas using both methods. Using the first method, general-use formulas can be created. That is, the result will be an object that can either be included directly in our report or used as a value in another function, style, or attribute. We can create objects that make calculations at a general level to be included in sections that include Report Header, Group Footer, and so on, or we can make calculations to be included in the Details section. In this last case, the formula will make its calculation row by row. With this last example, we can make an important differentiation with respect to aggregate functions as they usually can only calculate totals and subtotals. Using the second method, we create specific-use functions that affect the value of the style or attribute of an individual object. The way to use these functions is simple. Just choose the value you want to modify in the Style and Attributes tabs and click on the button that appears on their right. In this way, you can create formulas that dynamically assign values to an object's color, position, width, length, format, visibility, and so on. Using this technique, stoplights can be created by assigning different values to an object according to a calculation, progress bars can be created by changing an object's length, and dynamic images can be placed in the report using the result of a formula to calculate the image's path. As we have seen in the examples, using formulas in our reports gives us great flexibility in applying styles and attributes to objects and to the report itself, as well as the possibility of creating our own objects based on complex calculations. By using formulas correctly, you will be able to give life to your reports and adapt them to changing contexts. For example, depending on which user executes the report, a certain image can appear in the Report Header section, or graphics and subreports can be hidden if the user does not have sufficient permissions. The formula editor The formula editor has a very intuitive and easy-to-use UI that in addition to guiding us in creating formulas, tells us, whenever possible, the value that the formula will return. In the following screenshot, you can see the formula editor: We will explain its layout with an example. Let's suppose that we added a new label and we want to create a formula that returns the value of Attributes.Value. For this purpose, we do the following: Select the option to the right of Attributes.Value. This will open the formula editor. In the upper-left corner, there is a selector where we can specify the category of functions that we want to see. Below this, we find a list of the functions that we can use to create our own formulas. In the lower-left section, we can see more information about the selected function; that is, the type of value that it will return and a general description: We choose the CONCATENATE function by double-clicking on it, and in the lower-right section, we can see the formula (Formula:) that we will use. We type in =CONCATENATE(Any), and an assistant will open in the upper-right section that will guide us in entering the values we want to concatenate. We could complete the CONCATENATE function by adding some fixed values and some variables; take the following example: If there is an error in the text of the formula, text will appear to warn us. Otherwise, the formula editor will try to show us the result that our formula will return. When it is not possible to visualize the result that a formula will return, this is usually because the values used are calculated during the execution of the report. Formulas should always begin with the = sign. Initially, one tends to use the help that the formula editor provides, but later, with more practice, it will become evident that it is much faster to type the formula directly. Also, if you need to enter complex formulas or add various functions with logical operators, the formula editor will not be of use.
Read more
  • 0
  • 0
  • 2977

article-image-lucenenet-optimizing-and-merging-index-segments
Packt
20 Aug 2013
3 min read
Save for later

Lucene.NET: Optimizing and merging index segments

Packt
20 Aug 2013
3 min read
(For more resources related to this topic, see here.) How to do it… Index optimization is accomplished by calling the Optimize method on an instance of IndexWriter. The example for this recipe demonstrates the use of the Optimize method to clean up the storage of the index data on the physical disk. The general steps in the process to optimize and index segments are the following: Create/open an index. Add or delete documents from the index. Examine the MaxDoc and NumDocs properties of the IndexWriter class. If the index is deemed to be too dirty, call the Optimize method of the IndexWriter class. The following example for this recipe demonstrates taking these steps to create, modify, and then optimize an index. namespace Lucene.NET.HowTo._12_MergeAndOptimize {// ...// build facade and an initial index of 5 documentsvar facade = new LuceneDotNetHowToExamplesFacade().buildLexicographicalExampleIndex(maxDocs: 5).createIndexWriter();// report MaxDoc and NumDocsTrace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}",facade.IndexWriter.NumDocs()));// delete one documentfacade.IndexWriter.DeleteDocuments(new Term("filename", "0.txt"));facade.IndexWriter.Commit();// report MaxDoc and NumDocsTrace.WriteLine("After delete / commit");Trace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}", facade.IndexWriter.NumDocs()));// optimize the indexfacade.IndexWriter.Optimize();// report MaxDoc and NumDocsTrace.WriteLine("After Optimize");Trace.WriteLine(string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));Trace.WriteLine(string.Format("NumDocs=={0}", facade.IndexWriter.NumDocs()));Trace.Flush();// ...} How it works… When this program is run, you will see output similar to that in the following screenshot: This program first creates an index with five files. It then reports the values of the MaxDoc and NumDocs properties of the instance of IndexWriter. MaxDoc represents the maximum number of documents that have been stored in the index. It is possible to add more documents, but that may incur a performance penalty by needing to grow the index. NumDocs is the current number of documents stored in the index. At this point these values are 5 and 5, respectively. The next step deletes a single document named 0.txt from the index, and the changes are committed to disk. MaxDoc and NumDocs are written to the console again and now report 5 and 4 respectively. This makes sense as one file has been deleted and there is now "slop" in the index where space is being taken up from a previously deleted document. The reference to the document index information has been removed, but the space is still used on the disk. The final two steps are to call Optimize and to write MaxDoc and NumDocs values to the console, for the final time. These now are 4 and 4, respectively, as Lucene.NET has merged any index segments and removed any empty disk space formerly used by deleted document index information. Summary A Lucene.NET index physically contains one or more segments, each of which is its own index and holds a subset of the overall indexed content. As documents are added to the index, new segments are created as index writer's flush-buffered content into the index's directory and file structure. Over time this fragmentation will cause searches to slow, requiring a merge/optimization to be performed to regain performance. Resources for Article : Further resources on this subject: Extending Your Structure and Search [Article] Advanced Performance Strategies [Article] Creating your first collection (Simple) [Article]
Read more
  • 0
  • 0
  • 7341

Packt
14 Aug 2013
6 min read
Save for later

Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate)

Packt
14 Aug 2013
6 min read
(For more resources related to this topic, see here.) Often, we use Hadoop to calculate analytics, which are basic statistics about data. In such cases, we walk through the data using Hadoop and calculate interesting statistics about the data. Some of the common analytics are show as follows: Calculating statistical properties like minimum, maximum, mean, median, standard deviation, and so on of a dataset. For a dataset, generally there are multiple dimensions (for example, when processing HTTP access logs, names of the web page, the size of the web page, access time, and so on, are few of the dimensions). We can measure the previously mentioned properties by using one or more dimensions. For example, we can group the data into multiple groups and calculate the mean value in each case. Frequency distributions histogram counts the number of occurrences of each item in the dataset, sorts these frequencies, and plots different items as X axis and frequency as Y axis. Finding a correlation between two dimensions (for example, correlation between access count and the file size of web accesses). Hypothesis testing: To verify or disprove a hypothesis using a given dataset. However, Hadoop will only generate numbers. Although the numbers contain all the information, we humans are very bad at figuring out overall trends by just looking at numbers. On the other hand, the human eye is remarkably good at detecting patterns, and plotting the data often yields us a deeper understanding of the data. Therefore, we often plot the results of Hadoop jobs using some plotting program. Getting ready This article assumes that you have access to a computer that has Java installed and the JAVA_HOME variable configured. Download a Hadoop distribution 1.1.x from http://hadoop.apache.org/releases.html page. Unzip the distribution, we will call this directory HADOOP_HOME. Download the sample code for the article and copy the data files. How to do it... If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using the following commands: >bin/hadoopdfs -mkdir /data/>bin/hadoopdfs -mkdir /data/amazon-dataset>bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazondataset/>bin/hadoopdfs -ls /data/amazon-dataset Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME. Run the first MapReduce job to calculate the buying frequency. To do that run the following command from HADOOP_HOME: $ bin/hadoop jar hadoop-microbook.jar microbook.frequency.BuyingFrequencyAnalyzer/data/amazon-dataset /data/frequencyoutput1 Use the following command to run the second MapReduce job to sort the results of the first MapReduce job: $ bin/hadoop jar hadoop-microbook.jar microbook.frequency.SimpleResultSorter /data/frequency-output1 frequency-output2 You can find the results from the output directory. Copy results to HADOOP_HOME using the following command: $ bin/Hadoop dfs -get /data/frequency-output2/part-r-00000 1.data Copy all the *.plot files from SAMPLE_DIR to HADOOP_HOME. Generate the plot by running the following command from HADOOP_HOME. $gnuplot buyfreq.plot It will generate a file called buyfreq.png, which will look like the following: As the figure depicts, few buyers have brought a very large number of items. The distribution is much steeper than normal distribution, and often follows what we call a Power Law distribution. This is an example that analytics and plotting results would give us insight into, underlying patterns in the dataset. How it works... You can find the mapper and reducer code at src/microbook/frequency/BuyingFrequencyAnalyzer.java. This figure shows the execution of two MapReduce jobs. Also the following code listing shows the map function and the reduce function of the first job: public void map(Object key, Text value, Context context) throwsIOException, InterruptedException {List<BuyerRecord> records = BuyerRecord.parseAItemLine(value.toString());for(BuyerRecord record: records){context.write(new Text(record.customerID), new IntWritable(record.itemsBrought.size()));}}public void reduce(Text key, Iterable<IntWritable> values, Context context) {int sum = 0;for (IntWritableval : values) {sum += val.get();}result.set(sum);context.write(key, result);} As shown by the figure, Hadoop will read the input file from the input folder and read records using the custom formatter we introduced in the Writing a formatter (Intermediate) article. It invokes the mapper once per each record, passing the record as input. The mapper extracts the customer ID and the number of items the customer has brought, and emits the customer ID as the key and number of items as the value. Then, Hadoop sorts the key-value pairs by the key and invokes a reducer once for each key passing all values for that key as inputs to the reducer. Each reducer sums up all item counts for each customer ID and emits the customer ID as the key and the count as the value in the results. Then the second job sorted the results. It reads output of the first job as the result and passes each line as argument to the map function. The map function extracts the customer ID and the number of items from the line and emits the number of items as the key and the customer ID as the value. Hadoop will sort the key-value pairs by the key, thus sorting them by the number of items, and invokes the reducer once per key in the same order. Therefore, the reducer prints them out in the same order essentially sorting the dataset. Since we have generated the results, let us look at the plotting. You can find the source for the gnuplot file from buyfreq.plot. The source for the plot will look like the following: set terminal pngset output "buyfreq.png"set title "Frequency Distribution of Items brought by Buyer";setylabel "Number of Items Brought";setxlabel "Buyers Sorted by Items count";set key left topset log yset log xplot "1.data" using 2 title "Frequency" with linespoints Here the first two lines define the output format. This example uses png, but gnuplot supports many other terminals such as screen, pdf, and eps. The next four lines define the axis labels and the title, and the next two lines define the scale of each axis, and this plot uses log scale for both. The last line defines the plot. Here, it is asking gnuplot to read the data from the 1.data file, and to use the data in the second column of the file via using 2, and to plot it using lines. Columns must be separated by whitespaces. Here if you want to plot one column against another, for example data from column 1 against column 2, you should write using 1:2 instead of using 2. There's more... We can use a similar method to calculate the most types of analytics and plot the results. Refer to the freely available article of Hadoop MapReduce Cookbook, Srinath Perera and Thilina Gunarathne, Packt Publishing at http://www.packtpub.com/article/advanced-hadoop-mapreduce-administration for more information. Summary In this article, we have learned how to process Amazon data with MapReduce, generate data for a histogram, and plot it using gnuplot. Resources for Article : Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Comparative Study of NoSQL Products [Article] HBase Administration, Performance Tuning [Article]
Read more
  • 0
  • 0
  • 4878

article-image-calculus
Packt
14 Aug 2013
8 min read
Save for later

Calculus

Packt
14 Aug 2013
8 min read
(For more resources related to this topic, see here.) Derivatives To compute the derivative of a function, create the corresponding expression and use diff(). Its first argument is the expression and the second is the variable with regard to which you want to differentiate. The result is the expression for the derivative: >>> diff(exp(x**2), x)2*x*exp(x**2)>>> diff(x**2 * y**2, y)2*x**2*y Higher-order derivatives can also be computed with a single call to diff(): >>> diff(x**3, x, x)6*x>>> diff(x**3, x, 2)6*x>>> diff(x**2 * y**2, x, 2, y, 2)4 Due to SymPy's focus on expressions rather than functions, the derivatives for symbolic functions can seem a little surprising, but LaTeX rendering in the notebook should make their meaning clear. >>> f = Function('f')>>> diff(f(x**2), x)2*x*Subs(Derivative(f(_xi_1), _xi_1), (_xi_1,), (x**2,)) Let's take a look at the following screenshot: Limits Limits are obtained through limit(). The syntax for the limit of expr when x goes to some value x0 is limit(expr, x, x0). To specify a limit towards infinity, you need to use SymPy's infinity object, named oo. This object will also be returned for infinite limits: >>> limit(exp(-x), x, oo)0>>> limit(1/x**2, x, 0)oo There is also a fourth optional parameter, to specify the direction of approach of the limit target. "+" (the default) gives the limit from above, and "-" is from below. Obviously, this parameter is ignored when the limit target is infinite: >>> limit(1/x, x, 0, "-")-oo>>> limit(1/x, x, 0, "+")oo Let's take a look at the following screenshot: Integrals SymPy has powerful algorithms for integration, and, in particular, can find most integrals of logarithmic and exponential functions expressible with special functions, and many more besides, thanks to Meijer G-functions. The main function for integration is integrate(). It can compute both antiderivatives (indefinite integrals) and definite integrals. Note that the value of an antiderivative is only defined up to an arbitrary constant but the result does not include it. >>> integrate(sin(x), x)-cos(x)>>> integrate(sin(x), (x, 0, pi))2 Unevaluated symbolic integrals and antiderivatives are represented by the Integral class. integrate() may return these objects if it cannot compute the integral. It is also possible to create Integral objects directly, using the same syntax as integrate(). To evaluate them, call their .doit() method: >>> integral = Integral(sin(x), (x, 0, pi))>>> integralIntegral(sin(x), (x, 0, pi))>>> integral.doit()2 Let's take a look at the following screenshot: Taylor series A Taylor series approximation is an approximation of a function obtained by truncating its Taylor series. To compute it, use series(expr, x, x0, n), where x is the relevant variable, x0 is the point where the expansion is done (defaults to 0), and n is the order of expansion (defaults to 6): >>> series(cos(x), x)1 - x**2/2 + x**4/24 + O(x**6)>>> series(cos(x), x, n=10)1 - x**2/2 + x**4/24 - x**6/720 + x**8/40320 + O(x**10) The O(x**6) part in the result is a "big-O" object. Intuitively, it represents all the terms of order equal to or higher than 6. This object automatically absorbs or combines with powers of the variable, which makes simple arithmetic operations on expansions convenient: >>> O(x**2) + 2*x**3O(x**2)>>> O(x**2) * 2*x**3O(x**5)>>> expand(series(sin(x), x, n=6) * series(cos(x), x, n=4))x - 2*x**3/3 + O(x**5)>>> series(sin(x)*cos(x), x, n=5)x - 2*x**3/3 + O(x**5) If you want to use the expansion as an approximation of the function, the O() term prevents it from behaving like an ordinary expression, so you need to remove it. You can do so by using the aptly named .removeO() method: >>> series(cos(x), x).removeO()x**4/24 - x**2/2 + 1 Taylor series look better in the notebook, as shown in the following screenshot: Solving equations This section will teach you how to solve the different types of equations that SymPy handles. The main function to use for solving equations is solve(). Its interface is somewhat complicated as it accepts many different kinds of inputs and can output results in various forms depending on the input. In the simplest case, univariate equations, use the syntax solve(expr, x) to solve the equation expr = 0 for the variable x. If you want to solve an equation of the form A = B, simply put it under the preceding form, using solve(A - B, x). This can solve algebraic and transcendental equations involving rational fractions, square roots, absolute values, exponentials, logarithms, trigonometric functions, and so on. The result is then a list of the values of the variables satisfying the equation. The following commands show a few examples of equations that can be solved: >>> solve(x**2 - 1, x)[-1, 1]>>> solve(x*exp(x) - 1, x)[LambertW(1)]>>> solve(abs(x**2-4) - 3, x)[-1, 1, -sqrt(7), sqrt(7)] Note that the form of the result means that it can only return a finite set of solutions. In cases where the true solution is infinite, it can therefore be misleading. When the solution is an interval, solve() typically returns an empty list. For periodic functions, usually only one solution is returned: >>> solve(0, x) # all x are solutions[]>>> solve(x - abs(x), x) # all positive x are solutions[]>>> solve(sin(x), x) # all k*pi with k integer are solutions[0] The domain over which the equation is solved depends on the assumptions on the variable. Hence, if the variable is a real Symbol object, only real solutions are returned, but if it is complex, then all solutions in the complex plane are returned (subject to the aforementioned restriction on returning infinite solution sets). This difference is readily apparent when solving polynomials, as the following example demonstrates: >>> solve(x**2 + 1, x)[]>>> solve(z**2 + 1, z)[-I, I] There is no restriction on the number of variables appearing in the expression. Solving a multivariate expression for any of its variables allows it to be expressed as a function of the other variables, and to eliminate it from other expressions. The following example shows different ways of solving the same multivariate expression: >>> solve(x**2 - exp(a), x)[-exp(a/2), exp(a/2)]>>> solve(x**2 - exp(a), a)[log(x**2)]>>> solve(x**2 - exp(a), x, a)[{x: -exp(a/2)}, {x: exp(a/2)}]>>> solve(x**2 - exp(a), x, b)[{x: -exp(a/2)}, {x: exp(a/2)}] To solve a system of equations, pass a list of expressions to solve(): each one will be interpreted, as in the univariate case, as an equation of the form expr = 0. The result can be returned in one of two forms, depending on the mathematical structure of the input: either as a list of tuples, where each tuple contains the values for the variables in the order given to solve, or a single dictionary, suitable for use in subs(), mapping variables to their values. As you can see in the following example, it can be hard to predict what form the result will take: >>> solve([exp(x**2) - y, y - 3], x, y)[(-sqrt(log(3)), 3), (sqrt(log(3)), 3)]>>> solve([x**2 - y, y - 3], x, y)[(-sqrt(3), 3), (sqrt(3), 3)]>>> solve([x - y, y - 3], x, y){y: 3, x: 3} This variability in return types is fine for interactive use, but for library code, more predictability is required. In this case, you should use the dict=True option. The output will then always be a list of mappings of variables to value. Compare the following example to the previous one: >>> solve([x**2 - y, y - 3], x, y, dict=True)[{y: 3, x: -sqrt(3)}, {y: 3, x: sqrt(3)}]>>> solve([x - y, y - 3], x, y, dict=True)[{y: 3, x: 3}] Summary We successfully computed the various mathematical operations using the SymPy application, Calculus. Resources for Article : Further resources on this subject: Move Further with NumPy Modules [Article] Advanced Indexing and Array Concepts [Article] Running a simple game using Pygame [Article]
Read more
  • 0
  • 0
  • 1911
article-image-using-unrestricted-languages
Packt
13 Aug 2013
15 min read
Save for later

Using Unrestricted Languages

Packt
13 Aug 2013
15 min read
(For more resources related to this topic, see here.) Are untrusted languages inferior to trusted ones? No, on the contrary, these languages are untrusted in the same way that a sharp knife is untrusted and should not be trusted to very small children, at least not without adult supervision. They have extra powers that ordinary SQL or even the trusted languages (such as PL/pgSQL) and trusted variants of the same language (PL/Perl versus PL/Perlu) don't have. You can use the untrusted languages to directly read and write on the server's disks, and you can use it to open sockets and make Internet queries to the outside world. You can even send arbitrary signals to any process running on the database host. Generally, you can do anything the native language of the PL can do. However, you probably should not trust arbitrary database users to have the right to define functions in these languages. Always think twice before giving all privileges on some untrusted language to a user or group by using the *u languages for important functions. Can you use the untrusted languages for important functions? Absolutely. Sometimes, it may be the only way to accomplish some tasks from inside the server. Performing simple queries and computations should do nothing harmful to your database, and neither should connecting to the external world for sending e-mails, fetching web pages, or doing SOAP requests. They may cause delays and even queries that get stuck, but these can usually be dealt with by setting an upper limit as to how long a query can run by using an appropriate statement time-out value. Setting a reasonable statement time-out value by default is a good practice anyway. So, if you don't deliberately do risky things, the probability of harming the database is no bigger than using a "trusted" (also known as "restricted") variant of the language. However, if you give the language to someone who starts changing bytes on the production database "to see what happens", you probably get what you asked for. Will untrusted languages corrupt the database? The power to corrupt the database is definitely there, since the functions run as the system user of the database server with full access to the filesystem. So, if you blindly start writing into the data files and deleting important logs, it is very likely that your database will be corrupted. Additional types of denial-of-service attacks are also possible such as using up all memory or opening all IP ports; but there are ways to overload the database using plain SQL as well, so that part is not much different from the trusted database access with the ability to just run arbitrary queries. So yes, you can corrupt the database, but please don't do it on a production server. If you do, you will be sorry. Why untrusted? PostgreSQL's ability to use an untrusted language is a powerful way to perform some nontraditional things from database functions. Creating these functions in a PL is an order of magnitude smaller task than writing an extension function in C. For example, a function to look up a hostname for an IP address is only a few lines in PL/Pythonu: CREATE FUNCTION gethostbyname(hostname text) RETURNS inet AS $$ import socket return socket.gethostbyname(hostname) $$ LANGUAGE plpythonu SECURITY DEFINER; You can test it immediately after creating the function by using psql: hannu=# select gethostbyname('www.postgresql.org'); gethostbyname ---------------- 98.129.198.126 (1 row) Creating the same function in the most untrusted language, C, involves writing tens of lines of boilerplate code, worrying about memory leaks, and all the other problems coming from writing code in a low-level language. I recommend prototyping in some PL language if possible, and in an untrusted language if the function needs something that the restricted languages do not offer. Why PL/Python? All of these tasks could be done equally well using PL/Perlu or PL/Tclu; I chose PL/Pythonu mainly because Python is the language I am most comfortable with. This also translates to having written some PL/Python code, which I plan to discuss and share with you in this article. Quick introduction to PL/Python PL/pgSQL is a language unique to PostgreSQL and was designed to add blocks of computation and SQL inside the database. While it has grown in its breath of functionality, it still lacks the completeness of syntax of a full programming language. PL/Python allows your database functions to be written in Python with all the depth and maturity of writing a Python code outside the database. A minimal PL/Python function Let's start from the very beginning (again): CREATE FUNCTION hello(name text) RETURNS text AS $$ return 'hello %s !' % name $$ LANGUAGE plpythonu; Here, we see that creating the function starts by defining it as any other PostgreSQL function with a RETURNS definition of a text field: CREATE FUNCTION hello(name text) RETURNS text The difference from what we have seen before is that the language part is specifying plpythonu (the language ID for PL/Pythonu language): $$ LANGUAGE plpythonu; Inside the function body it is very much a normal python function, returning a value obtained by the name passed as an argument formatted into a string 'hello %s !' using the standard Python formatting operator %: return 'hello %s !' % name Finally, let's test how this works: hannu=# select hello('world'); hello --------------- hello world ! (1 row) And yes, it returns exactly what is expected! Data type conversions The first and last things happening when a PL function is called by PostgreSQL are converting argument values between the PostgreSQL and PL types. The PostgreSQL types need to be converted to the PL types on entering the function, and then the return value needs to be converted back into the PostgreSQL type. Except for PL/pgSQL, which uses PostgreSQL's own native types in computations, the PLs are based on existing languages with their own understanding of what types (integer, string, date, …) are, how they should behave, and how they are represented internally. They are mostly similar to PostgreSQL's understanding but quite often are not exactly the same. PL/Python converts data from PostgreSQL type to Python types as shown in the following table: PostgreSQL Python 2 Python 3 Comments int2, int4 int int   int8 long int   real, double, numeric float float This may lose precision for numeric values. bytea str bytes No encoding conversion is done, nor should any encoding be assumed. text, char(), varchar(), and other text types str str On Python 2, the string will be in server encoding. On Python 3, it is an unicode string. All other types str str PostgreSQL's type output function is used to convert to this string. Inside the function, all computation is done using Python types and the return value is converted back to PostgreSQL using the following rules (this is a direct quote from official PL/Python documentation at http://www.postgresql.org/docs/current/static/plpython-data.html): When the PostgreSQL return type is Boolean, the return value will be evaluated for truth according to the Python rules. That is, 0 and empty string are false, but notably f is true. When the PostgreSQL return type is bytea, the return value will be converted to a string (Python 2) or bytes (Python 3) using the respective Python built-ins, with the result being converted bytea. For all other PostgreSQL return types, the returned Python value is converted to a string using Python's built-in str, and the result is passed to the input function of the PostgreSQL data type. Strings in Python 2 are required to be in the PostgreSQL server encoding when they are passed to PostgreSQL. Strings that are not valid in the current server encoding will raise an error; but not all encoding mismatches can be detected, so garbage data can still result when this is not done correctly. Unicode strings are converted to the correct encoding automatically, so it can be safer and more convenient to use those. In Python 3, all strings are Unicode strings. In other words, anything but 0, False, and an empty sequence, including empty string ' ' or dictionary becomes PostgreSQL false. One notable exception to this is that the check for None is done before any other conversions and even for Booleans, None is always converted to NULL and not to the Boolean value false. For the bytea type, the PostgreSQL byte array, the conversion from Python's string representation, is an exact copy with no encoding or other conversions applied. Writing simple functions in PL/Python Writing functions in PL/Python is not much different in principle from writing functions in PL/pgSQL. You still have the exact same syntax around the function body in $$, and the argument name, types, and returns all mean the same thing regardless of the exact PL/language used. A simple function So a simple add_one() function in PL/Python looks like this: CREATE FUNCTION add_one(i int) RETURNS int AS $$ return i + 1; $$ LANGUAGE plpythonu; It can't get much simpler than that, can it? What you see here is that the PL/Python arguments are passed to the Python code after converting them to appropriate types, and the result is passed back and converted to the appropriate PostgreSQL type for the return value. Functions returning a record To return a record from a Python function, you can use: A sequence or list of values in the same order as the fields in the return record A dictionary with keys matching the fields in the return record A class or type instance with attributes matching the fields in the return record Here are samples of the three ways to return a record. First, using an instance: CREATE OR REPLACE FUNCTION userinfo( INOUT username name, OUT user_id oid, OUT is_superuser boolean) AS $$ class PGUser: def __init__(self,username,user_id,is_superuser): self.username = username self.user_id = user_id self.is_superuser = is_superuser u = plpy.execute(""" select usename,usesysid,usesuper from pg_user where usename = '%s'""" % username)[0] user = PGUser(u['usename'], u['usesysid'], u['usesuper']) return user $$ LANGUAGE plpythonu; Then, a little simpler one using a dictionary: CREATE OR REPLACE FUNCTION userinfo( INOUT username name, OUT user_id oid, OUT is_superuser boolean) AS $$ u = plpy.execute(""" select usename,usesysid,usesuper from pg_user where usename = '%s'""" % username)[0] return {'username':u['usename'], 'user_id':u['usesysid'], 'is_ superuser':u['usesuper']} $$ LANGUAGE plpythonu; Finally, using a tuple: CREATE OR REPLACE FUNCTION userinfo( INOUT username name, OUT user_id oid, OUT is_superuser boolean) AS $$ u = plpy.execute(""" select usename,usesysid,usesuper from pg_user where usename = '%s'""" % username)[0] return (u['usename'], u['usesysid'], u['usesuper']) $$ LANGUAGE plpythonu; Notice [0] at the end of u = plpy.execute(...)[0] in all the examples. It is there to extract the first row of the result, as even for one-row results plpy.execute still returns a list of results. Danger of SQL injection! As we have neither executed a prepare() method and executed a execute() method with arguments after it, nor have we used the plpy.quote_literal() method (both techniques are discussed later) to safely quote the username before merging it into the query, we are open to a security flaw known as SQL injection. So, make sure that you only let trusted users call this function or supply the username argument. Calling the function defined via any of these three CREATE commands will look exactly the same: hannu=# select * from userinfo('postgres'); username | user_id | is_superuser ----------+---------+-------------- postgres | 10 | t (1 row) It usually does not make sense to declare a class inside a function just to return a record value. This possibility is included mostly for cases where you already have a suitable class with a set of attributes matching the ones the function returns. Table functions When returning a set from a PL/Python functions, you have three options: Return a list or any other sequence of return type Return an iterator or generator yield the return values from a loop Here, we have three ways to generate all even numbers up to the argument value using these different styles. First, returning a list of integers: CREATE FUNCTION even_numbers_from_list(up_to int) RETURNS SETOF int AS $$ return range(0,up_to,2) $$ LANGUAGE plpythonu; The list here is returned by a built-in Python function called range, which returns a result of all even numbers below the argument. This gets returned as a table of integers, one integer per row from the PostgreSQL function. If the RETURNS clause of the function definition would say int[] instead of SETOF int, the same function would return a single number of even integers as a PostgreSQL array. The next function returns a similar result using a generator and returning both the even number and the odd one following it. Also, notice the different PostgreSQL syntax RETURNS TABLE(...) used this time for defining the return set: CREATE FUNCTION even_numbers_from_generator(up_to int) RETURNS TABLE (even int, odd int) AS $$ return ((i,i+1) for i in xrange(0,up_to,2)) $$ LANGUAGE plpythonu; The generator is constructed using a generator expression (x for x in <seq>). Finally, the function is defined using a generator using and explicit yield syntax, and yet another PostgreSQL syntax is used for returning SETOF RECORD with the record structure defined this time by OUT parameters: CREATE FUNCTION even_numbers_with_yield(up_to int, OUT even int, OUT odd int) RETURNS SETOF RECORD AS $$ for i in xrange(0,up_to,2): yield i, i+1 $$ LANGUAGE plpythonu; The important part here is that you can use any of the preceding ways to define a PL/Python set returning function, and they all work the same. Also, you are free to return a mixture of different types for each row of the set: CREATE FUNCTION birthdates(OUT name text, OUT birthdate date) RETURNS SETOF RECORD AS $$ return ( {'name': 'bob', 'birthdate': '1980-10-10'}, {'name': 'mary', 'birthdate': '1983-02-17'}, ['jill', '2010-01-15'], ) $$ LANGUAGE plpythonu; This yields result as follows: hannu=# select * from birthdates(); name | birthdate ------+------------ bob | 1980-10-10 mary | 1983-02-17 jill | 2010-01-15 (3 rows) As you see, the data returning a part of PL/Pythonu is much more flexible than returning data from a function written in PL/pgSQL. Running queries in the database If you have ever accessed a database in Python, you know that most database adapters conform to a somewhat loose standard called Python Database API Specification v2.0 or DBAPI 2 for short. The first thing you need to know about database access in PL/Python is that in-database queries do not follow this API. Running simple queries Instead of using the standard API, there are just three functions for doing all database access. There are two variants: plpy.execute() for running a query, and plpy.prepare() for turning query text into a query plan or a prepared query. The simplest way to do a query is with: res = plpy.execute(<query text>, [<row count>]) This takes a textual query and an optional row count, and returns a result object, which emulates a list of dictionaries, one dictionary per row. As an example, if you want to access a field 'name' of the third row of the result, you use: res[2]['name'] The index is 2 and not 3 because Python lists are indexed starting from 0, so the first row is res[0], the second row res[1], and so on. Using prepared queries In an ideal world this would be all that is needed, but plpy.execute(query, cnt) has two shortcomings: It does not support parameters The plan for the query is not saved, requiring the query text to be parsed and run through the optimizer at each invocation We will show a way to properly construct a query string later, but for most uses simple case parameter passing is enough. So, the execute(query, [maxrows]) call becomes a set of two statements: plan = plpy.prepare(<query text>, <list of argument types>) res = plpy.execute(plan, <list of values>, [<row count>])For example, to query if a user 'postgres' is a superuser, use the following: plan = plpy.prepare("select usesuper from pg_user where usename = $1", ["text"]) res = plpy.execute(plan, ["postgres"]) print res[0]["usesuper"] The first statement prepares the query, which parses the query string into a query tree, optimizes this tree to produce the best query plan available, and returns the prepared_query object. The second row uses the prepared plan to query for a specific user's superuser status. The prepared plan can be used multiple times, so that you could continue to see if user bob is superuser. res = plpy.execute(plan, ["bob"]) print res[0]["usesuper"] Caching prepared queries Preparing the query can be quite an expensive step, especially for more complex queries where the optimizer has to choose from a rather large set of possible plans; so, it makes sense to re-use results of this step if possible. The current implementation of PL/Python does not automatically cache query plans (prepared queries), but you can do it easily yourself. try: plan = SD['is_super_qplan'] except: SD['is_super_qplan'] = plpy.prepare(".... plan = SD['is_super_qplan'] <the rest of the function> The values in SD[] and GD[] only live inside a single database session, so it only makes sense to do the caching in case you have long-lived connections.
Read more
  • 0
  • 0
  • 5362

Packt
12 Aug 2013
14 min read
Save for later

Quick start – Creating your first Java application

Packt
12 Aug 2013
14 min read
(For more resources related to this topic, see here.) Cassandra's storage architecture is designed to manage large data volumes and revolve around some important factors: Decentralized systems Data replication and transparency Data partitioning Decentralized systems are systems that provide maximum throughput from each node.Cassandra offers decentralization by keeping each node with an identical configuration. There are no such master-slave configurations between nodes. Data is spread across nodes and each node is capable of serving read/write requests with the same efficiency. A data center is a physical space where critical application data resides. Logically, a data center is made up of multiple racks, and each rack may contain multiple nodes. Cassandra replication strategies Cassandra replicates data across the nodes based on configured replication. If the replication factor is 1, it means that one copy of the dataset will be available on one node only. If the replication factor is 2, it means two copies of each dataset will be made available on different nodes in the cluster. Still, Cassandra ensures data transparency, as for an end user data is served from one logical cluster. Cassandra offers two types of replication strategies. Simple strategy Simple strategy is best suited for clusters involving a single data center, where data is replicated across different nodes based on the replication factor in a clockwise direction. With a replication factor of 3, two more copies of each row will be copied on nearby nodes in a clockwise direction: Network topology strategy Network topology strategy ( NTS ) is preferred when a cluster is made up of nodes spread across multiple data centers. With NTS, we can configure the number of replicas needed to be placed within each data center. Data colocation and no single point of failure are two important factors that we need to consider priorities while configuring the replication factor and consistency level. NTS identifies the first node based on the selected schema partitioning and then looks up for nodes in a different rack (in the same data center). In case there is no such node, data replicas will be passed on to different nodes within the same rack. In this way, data colocation can be guaranteed by keeping the replica of a dataset in the same data center (to serve read requests locally). This also minimizes the risk of network latency at the same time. NTS depends on snitch configuration for proper data replica placement across different data centers. A snitch relies upon the node IP address for grouping nodes within the network topology. Cassandra depends upon this information for routing data requests internally between nodes. The preferred snitch configurations for NTS are RackInferringSnitch and PropertyFileSnitch . We can configure snitch in cassandra.yaml (the configuration file). Data partitioning Data partitioning strategy is required for node selection of a given data read/request. Cassandra offers two types of partitioning strategies. Random partitioning Random partitioning is the recommended partitioning scheme for Cassandra. Each node is assigned a 128-bit token value ( initial_token for a node is defined in cassandra.yaml) generated by a one way hashing (MD5) algorithm. Each node is assigned an initial token value (to determine the position in a ring) and a data range is assigned to the node. If a read/write request with the token value (generated for a row key value) lies within the assigned range of nodes, then that particular node is responsible for serving that request. The following diagram is a common graphical representation of the numbers of nodes placed in a circular representation or a ring, and the data range is evenly distributed between these nodes: Ordered partitioning Ordered partitioning is useful when an application requires key distribution in a sorted manner. Here, the token value is the actual row key value. Ordered partitioning also allows you to perform range scans over row keys. However, with ordered partitioning, key distribution might be uneven and may require load balancing administration. It is certainly possible that the data for multiple column families may get unevenly distributed and the token range may vary from one node to another. Hence, it is strongly recommended not to opt for ordered partitioning unless it is really required. Cassandra write path Here, we will discuss how the Cassandra process writes a request and stores it on a disk: As we have mentioned earlier, all nodes in Cassandra are peers and there is no master-slave configuration. Hence, on receiving a write request, a client can select any node to serve as a coordinator. The coordinator node is responsible for delegating write requests to an eligible node based on the cluster's partitioning strategy and replication factor. First, it is written to a commit log and then it is delegated to corresponding memtables (see the preceding diagram). A memtable is an in-memory table, which serves subsequent read requests without any look up in the disk. For each column family, there is one memtable. Once a memtable is full, data is flushed down in the form of SS tables (on disk), asynchronously. Once all the segments are flushed onto the disk, they are recycled. Periodically, Cassandra performs compaction over SS tables (sorted by row keys) and claims unused segments. In case of data node restart (unwanted scenarios such as failover), the commit log replay will happen, to recover any previous incomplete write requests. Hands on with the Cassandra command-line interface Cassandra provides a default command-line interface that is located at: CASSANDRA_HOME/bin/cassandra-cli.sh using Linux CASSANDRA_HOME/bin/cassandra-cli.bat using Windows Before we proceed with the sample exercise, let's have a look at the Cassandra schema: Keyspace: A keyspace may contain multiple column families; similarly, a cluster (made up of multiple nodes) can contain multiple keyspaces. Column family: A column family is a collection of rows with defined column metadata. Cassandra offers different ways to define two types of column families, namely, static and dynamic column families. Static column family: A static column family contains a predefined set of columns with metadata. Please note that a predefined set of columns may exist, but the number of columns can vary across multiple rows within the column family. Dynamic column family: A dynamic column family generally defines a comparator type and validation class for all columns instead of individual column metadata. The client application is responsible for providing columns for a particular row key, which means the column names and values may differ across multiple row keys: Column: A column can be attributed as a cell, which contains a name, value, and timestamp. Super column: A super column is similar to a column and contains a name, value, and timestamp, except that a super column value may contain a collection of columns. Super columns cannot be sorted; however, subcolumns within super columns can be sorted by defining a sub comparator. Super columns do have some limitations, such as that secondary indexes over super columns are not possible. Also, it is not possible to read a particular super column without deserialization of the wrapped subcolumns. Because of such limitations, usage of super columns is highly discouraged within the Cassandra community. Using composite columns we can achieve such functionalities. In the next articles, we will cover composite columns in detail: Counter column family: Since 0.8 onwards, Cassandra has enabled support for counter columns. Counter columns are useful for applications that perform the following: Maintain the page count for the website Do aggregation based on a column value from another column family A counter column is a sort of 64 bit signed integer. To create a counter column family, we simply need to define default_validation_class as CounterColumnType. Counter columns do have some application and technical limitations: In case of events, such as disk failure, it is not possible to replay a column family containing counters without reinitializing and removing all the data Secondary indexes over counter columns are not supported in Cassandra Frequent insert/delete operations over the counter column in a short period of time may result in inconsistent counter values There are still some unresolved issues (https://issues.apache.org/jira/browse/CASSANDRA-4775) and to considering the preceding limitations before opting for counter columns is recommended. You can start a Cassandra server simply by running $CASSANDRA_HOME/bin/ cassandra. If started in the local mode, it means there is only one node. Once successfully started, you should see logs on your console, as follows: Cassandra-cli: Cassandra distribution, by default, provides a command-line utility (cassandra-cli ), which can be used for basic ddl /dml operations; you can connect to a local/remote Cassandra server instance by specifying the host and port options, as follows: $CASSANDRA_HOME/bin/cassandra-cli -host locahost -port 9160 Performing DDL/DML operations on the column family First, we need to create a keyspace using the create keyspace command, as follows: The create keyspace command: This operation will create a keyspace cassandraSample with node placement strategy as SimpleStrategy and replication factor one. By default, if you don't specify placement_strategy and strategy_options, it will opt for NTS, where replication will be on one data center: create keyspace cassandraSample with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; We can look for available keyspaces by running the following command: show keyspaces; This will result in the following output: We can always update the keyspace for configurations, such as replication factor. To update the keyspace, do the following: Modify the replication factor: You can update a keyspace for changing the replication factor as well as the placement strategy. For example, to change a replication factor to 2 for cassandraSample, you simply need to execute the following command: update keyspace cassandraSample with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:2}; Modify the placement strategy: You can change the placement strategy for NTS by executing the following command: update keyspace cassandraSample with placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options = {datacenter1:1}; Strategy options are in the format {datacentername:number of replicas}, and there can be multiple datacenters. After successfully creating a keyspace before proceeding with other ddl operations (for example, column family creation), we need to authorize a keyspace. We will authorize to a keyspace using the following command: use cassandraSample; Create a column family/super column family as follows: Use the following command to create column family users within the cassandraSample keyspace: create column family users with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and default_validation_class = 'UTF8Type'; To create a super column family suysers, you need to run the following command: create column family suysers with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and subcomparator='UTF8Type' and default_validation_class = 'UTF8Type' and column_type='Super' and column_metadata=[{column_name: name, validation_class: UTF8Type}]; key_validation_class: It defines the datatype for the row key comparator: It defines the datatype for the column name default_validation_class: It defines the datatype for the column value subcomparator: It defines the datatype for subcolumns. You can create/update a column by using the set method as follows: // create a column named "username", with a value of "user1" for row key 1set users[1][username] = user1; // create a column named "password", with a value of "password1" for row key 1 set users[1][password] = password1;// create a column named "username", with a value of "user2" for row key 2set users[2][username] = user2; // create a column named "password", with a value of "password2" for row key 2 set users[2][password] = password2; To fetch all the rows and columns from a column family, execute the following command: // to list down all persisted rows within a column family.list users ; // to fetch a row from users column family having row key value "1".get users[1]; You can delete a column as follows: // to delete a column "username" for row key 1;del users[1][username]; To update the column family, do the following: If you want to change key_validation_class from UTF8Type to BytesType and validation_class for the password column from UTF8Type to BytesType, then type the following command: update column family users with key_validation_class=BytesType and comparator=UTF8Type and column_metadata = [{column_name:password, validation_class:BytesType}] To drop/truncate the column family, follow the ensuing steps: Delete all the data rows from a column family users, as follows: truncate users; Drop a column family by issuing the following command: drop column family users; These are some basic operations that should give you a brief idea about how to create/manage the Cassandra schema. Cassandra Query Language Cassandra is schemaless, but CQL is useful when we need data modeling with the traditional RDBMS flavor. Two variants of CQL (2.0 and 3.0) are provided by Cassandra. We will use CQL3.0 for a quick exercise. We will refer to similar exercises, as we follow with the Cassandra-cli interface. The command to connect with cql is as follows: $CASSANDRA_HOME/bin/cqlsh host port cqlversion You can connect to the localhost and 9160 ports by executing the following command: $CASSANDRA_HOME/bin/cqlsh localhost 9160 -3 After successfully connecting to the command-line CQL client, you can create the keyspace as follows: create keyspace cassandrasample with strategy_class='SimpleStrategy' and strategy_options:replication_factor=1;Update keyspacealter keyspace cassandrasample with strategy_class='NetworkTopologyStrategy' and strategy_options:datacenter=1; Before creating any column family and storing data, we need to authorize such ddl/dml operations to a keyspace (for example, cassandraSample). We can authorize to a keyspace as follows: use cassandrasample; We can always run the describe keyspace command to look into containing column families and configuration settings. We can describe a keyspace as follows: describe keyspace cassandrasample; We will create a users column family with user_id as row key and username and password as columns. To create a column family, such as users, use the following command: create columnfamily users(user_id varchar PRIMARY KEY,username varchar, password varchar); To store a row in the users column family for row key value 1, we will run the following CQL query: insert into users(user_id,username,password) values(1,'user1','password1'); To select all the data from the users column family, we need to execute the following CQL query: select * from users; We can delete a row as well as specific columns using the delete operation. The following command-line scripts are to perform the deletion of a complete row and column age from the users column family, respectively: // delete complete row for user_id=1delete from users where user_id=1; // delete age column from users for row key 1.delete age from users where user_id=1; You can update a column family to add columns and to update or drop column metadata. Here are a few examples: // add a new columnalter columnfamily users add age int; // update column metadataalter columnfamily users alter password type blob; Truncating a column family will delete all the data belonging to the corresponding column family, whereas dropping a column family will also remove the column family definition along with the containing data. We can drop/truncate the column family as follows: truncate users;drop columnfamily users; Dropping a keyspace means instantly removing all the column families and data available within that keyspace.We can drop a keyspace using the following command: drop keyspace cassandrasample; By default, the CQL shell converts the column family and keyspace name to lowercase. You can ensure case sensitivity by wrapping these identifiers within " " . Summary This article showed how to create a Java application using Cassandra. Resources for Article: Further resources on this subject: Getting Started with Apache Cassandra [Article] Apache Cassandra: Working in Multiple Datacenter Environments [Article] Apache Cassandra: Libraries and Applications [Article]
Read more
  • 0
  • 0
  • 2040

article-image-overview-sql-server-reporting-services-2012-architecture-features-and-tools
Packt
08 Aug 2013
15 min read
Save for later

Overview of SQL Server Reporting Services 2012 Architecture, Features, and Tools

Packt
08 Aug 2013
15 min read
(For more resources related to this topic, see here.) Structural design of SQL servers and SharePoint environment Depending on the business and the resources available, the various servers may be located in distributed locations and the Web applications may also be run from Web servers in a farm and the same can be true for SharePoint servers. In this article, by the word architecture we mean the way by which the preceding elements are put together to work on a single computer. However, it is important to know that this is just one topology (an arrangement of constituent elements) and in general it can be lot more complicated spanning networks and reaching across boundaries. The Report Server is the centerpiece of the Reporting Services installation. This installation can be deployed in two modes, namely, Native mode or SharePoint Integrated mode. Each mode has a separate engine and an extensible architecture. It consists of a collection of special-purpose extensions that handle authentication, data processing, rendering, and delivery operations. Once deployed in one mode it cannot be changed to the other. It is possible to have two servers each installed in a different mode. We have installed all the necessary elements to explore the RS 2012 features, including Power View and Data Alerts. The next diagram briefly shows the structural design of the environment used in working with the article: Primarily, SQL Server 2012 Enterprise Edition is used, for both Native mode as well as SharePoint Integrated mode. As we see in the previous diagram, Report Server Native mode is on a named instance HI (in some places another named instance Kailua is also used). This server has the Reporting Services databases ReportServer$HI and ReportServer$HITempDB. The associated Report Server handles Jobs, Security, and Shared Schedules. The Native mode architecture described in the next section is taken from the Microsoft documentation. The tools (SSDT, Report Builder, Report Server Configuration, and so on) connect to the Report Server. The associated SQL Server Agent takes care of the jobs such as subscriptions related to Native mode. The SharePoint Server 2010 is a required element with which the Reporting Services add-in helps to create a Reporting Services Service. With the creation of the RS Service in SharePoint, three SQL Server 2012 databases (shown alongside in the diagram) are created in an instance with its Reporting Services installed in SharePoint Integrated mode. The SQL Server 2012 instance NJ is installed in this fashion. These databases are repositories for report content including those related to Power Views and Data Alerts. The data sources(extension .rsds) used in creating Power View reports (extension.rdlx) are stored in the ReportingService_b67933dba1f14282bdf434479cbc8f8f database and the alerting related information is stored in the ReportingService_b67933dba1f14282bdf434479cbc8f8f_Alerting database. Not shown is an Express database that is used by the SharePoint Server for its content, administration, and so on. RS_ADD-IN allows you to create the service. You will use the Power Shell tool to create and manage the service. In order to create Power View reports, the new feature in SSRS 2012, you start off creating a data source in SharePoint library. Because of the RS Service, you can enable Reporting Services features such as Report Builder; and associate BISM file extensions to support connecting to tabular models created in SSDT deployed to Analysis Services Server. When Reporting Services is installed in SharePoint Integrated mode, SharePoint Web parts will be available to users that allow them to connect to RS Native mode servers to work with reports on the servers from within SharePoint Site. Native mode The following schematic taken from Microsoft documentation (http://msdn.microsoft.com/en-us/library/ms157231.aspx) shows the major components of a Native mode installation: The image shows clearly the several processors that are called into play before a report is displayed. The following are the elements of this processing: Processing extensions(data, rendering, report processing, and authentication) Designing tools(Report Builder, Report Designer) Display devices(browsers) Windows components that do the scheduling and delivery through extensions(Report Server databases, a SQL Server 2012 database, which store everything connected with reports) For the Reporting Services 2012 enabled in Native mode for this article, the following image shows the ReportServer databases and the Reporting Services Server. A similar server HI was also installed after a malware attack. The Report Server is implemented as a Microsoft Windows service called Report Server Service. SharePoint Integrated mode In SharePoint mode, a Report Server must run within a SharePoint Server (even in a standalone implementation). The Report Server processing, rendering, and management are all from SharePoint application server running the Reporting Services SharePoint shared service. For this to happen, at SQL Server installation time, the SharePoint Integrated mode has to be chosen. The access to reports and related operations in this case are from a SharePoint frontend. The following elements are required for SharePoint mode: SharePoint Foundation 2010 or SharePoint Server 2010 An appropriate version of the Reporting Services add-in for SharePoint products A SharePoint application server with a Reporting Services shared service instance and at least one Reporting Services service application The following diagram taken from Microsoft documentation illustrates the various parts of a SharePoint Integrated environment of Reporting Services. Note that the alerting Web service and Power View need SharePoint Integration. The numbered items and their description shown next are also from the same Microsoft document. Follow the link at the beginning of this section. The architectural details presented previously were taken from Microsoft documentation. Item number in the diagram   Description   1   Web servers or Web Frontends (WFE). The Reporting Services add-in must be installed on each Web server from which you want to utilize the Web application feature such as viewing reports or a Reporting Services management page for tasks such as managing data sources and subscriptions.   2   The add-in installs URL and SOAP endpoints for clients to communicate with application servers through the Reporting Services Proxy.   3   Application servers running a shared service. Scale-out of report processing is managed as part of the SharePoint farm and by adding the service to additional application servers.   4 You can create more than one Reporting Services service application with different configurations, including permissions, e-mail, proxy, and subscriptions.   5   Reports, data sources, and other items are stored in SharePoint content databases.   6   Reporting Services service applications create three databases for the Report Server, temp, and data alerting features. Configuration settings that apply to all SSRS service applications are stored in RSReportserver.config file.   When you install Reporting Services in SharePoint Integrated mode, several features that you are used to in Native mode will not be available. Some of them are summarized here from the MSDN site: URL access will work but you will have to access SharePoint URL and not Native mode URL. The Native mode folder hierarchy will not work. Custom Security extensions can be used but you need to use the special purpose security extension meant to be used for SharePoint Integration. You cannot use the Reporting Services Configuration Manager (of the Native mode installation).You should use the SharePoint Central Administration shown in this section (for Reporting Services 2008 and 2008 R2). Report Manager is not the frontend; in this case, you should use SharePoint Application pages. You cannot use Linked Reports, My Reports, and My Subscriptions in SharePoint mode. In SharePoint Integrated mode, you can work with Data Alerts and this is not possible in a Native mode installation. Power View is another thing you can do with SharePoint that is not available for Native mode. To access Power View the browser needs Silverlight installed. While reports with RDL extension are supported in both modes, reports with RDLX are only supported in SharePoint mode. SharePoint user token credentials, AAM Zones for internet facing deployments, SharePoint back and recovery, and ULS log support are only available for SharePoint mode. For the purposes of discussion and exercises in this article, a standalone server deployment is used as shown in the next diagram. It must be remembered that there are various other topologies of deployment possible using more than one computer. For a detailed description please follow the link http://msdn.microsoft.com/en-us/library/bb510781(v=sql.105).aspx. The standalone deployment is the simplest, in that all the components are installed on a single computer representative of the installation used for this article. The following diagram taken from the preceding link illustrates the elements of the standalone deployment: Reporting Services configuration For both modes of installation, information for Reporting Services components is stored in configuration files and the registry. During setup the configuration files are copied to the following locations: Native modeC:Program FilesMicrosoft SQL ServerMSRS11.MSSQLSERVER SharePoint Integrated modeC:Program FilesCommon FilesMicrosoft SharedWeb Server Extensions15WebServicesReporting Follow the link http://msdn.microsoft.com/en-us/library/ms155866.aspx for details. Native mode The Report Server Windows Service is an orchestrated set of applications that run in a single process using a single account with access to a single Report Server database with a set of configuration files listed here: Stored in   Description   Location   RSReportServer.config   Stores configuration settings for feature areas of the Report Server Service: Report Manager, the Report Server Web Service, and background processing.   <Installation directory> Reporting Services ReportServer   RSSrvPolicy.config   Stores the code access security policies for the server extensions.   <Installation directory> Reporting Services ReportServer   RSMgrPolicy.config   Stores the code access security policies for Report Manager.   <Installation directory> Reporting Services ReportManager   Web.config for the Report Server Web Service   Includes only those settings that are required for ASP.NET.   <Installation directory> Reporting Services ReportServer   Web.config for Report Manager   Includes only those settings that are required for ASP.NET.   <Installation directory> Reporting Services ReportManager   ReportingServicesService. exe.config   Stores configuration settings that specify the trace levels and logging options for the Report Server Service.   <Installation directory> Reporting Services ReportServer Bin Registry settings   Stores configuration state and other settings used to uninstall Reporting Services. If you are troubleshooting an installation or configuration problem, you can view these settings to get information about how the Report Server is configured.   Do not modify these settings directly as this can invalidate your installation.   HKEY_LOCAL_MACHINE SOFTWARE Microsoft Microsoft SQL Server <InstanceID> Setup and HKEY_ LOCAL_MACHINE SOFTWARE Microsoft Microsoft SQL ServerServices ReportServer   RSReportDesigner.config   Stores configuration settings for Report Designer. For more information follow the link http://msdn.microsoft.com/en-us/library/ms160346.aspx   <drive>:Program Files Microsoft Visual Studio 10 Common7 IDE PrivateAssemblies   RSPreviewPolicy.config   Stores the code access security policies for the server extensions used during report preview.   C:Program Files Microsoft Visual Studio 10.0 Common7IDE PrivateAssembliesr   First is the RSReportServer configuration file which can be found in the installation directory under Reporting Services. The entries in this file control the feature areas of the three components in the previous image, namely, Report Server Web Service, Report Server Service, Report Manager, and background processing. The ReportServer Configuration file has several sections with which you can modify the following features: General configuration settings URL reservations Authentication Service UI Extensions MapTileServerConfiguration (Microsoft Bing Maps SOAP Services that provides a tile background for map report items in the report) Default configuration file for a Native mode Report Server Default configuration file for a SharePoint mode Report Server The three areas previously mentioned (Report Server Web Service, Report Server Service, and Report Manager) all run in separate application domains and you can turn on/off elements that you may or may not need so as to improve security by reducing the surface area for attacks. Some functionality works for all the three components such as memory management and process health. For example, in the reporting server Kailua in this article, the service name is ReportServer$KAILUA. This service has no other dependencies. In fact, you can access the help file for this service when you look at Windows Services in the Control Panels shown. In three of the tabbed pages of this window you can access contextual help. SharePoint Integrated mode The following table taken from Microsoft documentation describes the configuration files used in the SharePoint mode Report Server. Configuration settings are stored in SharePoint Service application databases. Stored in   Description   Location   RSReportServer. config   Stores configuration settings for feature areas of the Report Server Service: Report Manager, the Report Server Web Service, and background processing.   <Installation directory> Reporting Services ReportServer   RSSrvPolicy.config   Stores the code access security policies for the server extensions.   <Installation directory> Reporting Services ReportServer   Web.config for the Report Server Web Service Registry settings   Stores configuration state and other settings used to uninstall Reporting Services. Also stores information about each Reporting Services service application.   Do not modify these settings directly as this can invalidate your installation.   HKEY_LOCAL_MACHINE SOFTWARE Microsoft Microsoft SQL Server <InstanceID> Setup   For example instance ID: MSSQL11.MSSQLSERVER and HKEY_LOCAL_MACHINE SOFTWAREMicrosoft Microsoft SQL Server Reporting Services Service Applications   RSReportDesigner. config   Stores configuration settings for Report Designer.   <drive>:Program Files Microsoft Visual Studio 10 Common7 IDE PrivateAssemblies   Hands-on exercise 3.1 – modifying the configuration file in Native mode We can make changes to the rsreportserver.config file if changes are required or some tuning has to be done. For example, you may need to change, to accommodate a different e-mail, change authentication, and so on. This is an XML file that can be edited in Notepad.exe (you can also use an XML Editor or Visual Studio). You need to start Notepad with administrator privileges. Turn on/off the Report Server Web Service In this exercise, we will modify the configuration file to turn on/off the Report Server Web Service. Perform the following steps: Start Notepad using Run as Administrator. Open the file at this location (you may use Start Search| for rsreportserver.config) which is located at C:Program FilesMicrosoft SQL ServerMSRS11.KAILUAReporting ServicesReportServerrsreportserver.config. In Edit Find| type in IsWebServiceEnabled. There are two values True/False. If you want to turn off, change TRUE to FALSE. The default is TRUE.Here is a section of the file reproduced: <Service> <IsSchedulingService>True</IsSchedulingService> <IsNotificationService>True</IsNotificationService> <IsEventService>True</IsEventService> <PollingInterval>10</PollingInterval> <WindowsServiceUseFileShareStorage>False </WindowsServiceUseFileShareStorage> <MemorySafetyMargin>80</MemorySafetyMargin> <MemoryThreshold>90</MemoryThreshold> <RecycleTime>720</RecycleTime> <MaxAppDomainUnloadTime>30</MaxAppDomainUnloadTime> <MaxQueueThreads>0</MaxQueueThreads> <UrlRoot> </UrlRoot> <UnattendedExecutionAccount> <UserName></UserName> <Password></Password> <Domain></Domain> </UnattendedExecutionAccount> <PolicyLevel>rssrvpolicy.config</PolicyLevel> <IsWebServiceEnabled>True</IsWebServiceEnabled> <IsReportManagerEnabled>True</IsReportManagerEnabled> <FileShareStorageLocation> <Path> </Path> </FileShareStorageLocation> </Service> Save the file to apply changes. Turn on/off the scheduled events and delivery This changes the report processing and delivery. Make changes in the rsreportserver.config file in the following section of <Service/>: <IsSchedulingService>True</IsSchedulingService> <IsNotificationService>True</IsNotificationService> <IsEventService>True</IsEventService> The default value for all of the three is TRUE. You can make it FALSE and save the file to apply changes. This can be carried out modifying FACET in SQL Server Management Studio (SSMS), but presently this is not available. Turn on/off the Report Manager Report Manager can be turned off or on by making changes to the configuration file. Make a change to the following section in the <Service/>: <IsReportManagerEnabled>True</IsReportManagerEnabled> Again, this change can be made using the Reporting Services Server in its FACET. To change this make sure you launch SQL Server Management Studio as Administrator. In the following sections use of SSMS via Facets is described. Hands-on exercise 3.2 – turn the Reporting Service on/off in SSMS The following are the steps to turn the Reporting Service on/off in SSMS: Connect to Reporting Services_KAILUA in SQL Server Management Studio as the Administrator. Choose HODENTEKWIN7KAILUA under Reporting Services. Click on OK. Right-click on HODENTEKWIN7KAILUA (Report Server 11.0.22180 –HodentekWin7mysorian). Click on Facets to open the following properties page Click on the handle and set it to True or False and click on OK. The default is True. It should be possible to turn Windows Integrated security on or off by using SQL Server Management Studio. However, the Reporting Services Server properties are disabled.
Read more
  • 0
  • 0
  • 8004
article-image-understanding-mapreduce
Packt
07 Aug 2013
18 min read
Save for later

Understanding MapReduce

Packt
07 Aug 2013
18 min read
(For more resources related to this topic, see here.) Key/value pairs Here we will explain why some operations process and provide the output in terms of key/value pair. What it mean Firstly, we will clarify just what we mean by key/value pairs by highlighting similar concepts in the Java standard library. The java.util.Map interface is the parent of commonly used classes such as HashMap and (through some library backward reengineering) even the original Hashtable. For any Java Map object, its contents are a set of mappings from a given key of a specified type to a related value of a potentially different type. A HashMap object could, for example, contain mappings from a person's name (String) to his or her birthday (Date). In the context of Hadoop, we are referring to data that also comprises keys that relate to associated values. This data is stored in such a way that the various values in the data set can be sorted and rearranged across a set of keys. If we are using key/value data, it will make sense to ask questions such as the following: Does a given key have a mapping in the data set? What are the values associated with a given key? What is the complete set of keys? We will go into Wordcount in detail shortly, but the output of the program is clearly a set of key/value relationships; for each word (the key), there is a count (the value) of its number of occurrences. Think about this simple example and some important features of key/value data will become apparent, as follows: Keys must be unique but values need not be Each value must be associated with a key, but a key could have no values (though not in this particular example) Careful definition of the key is important; deciding on whether or not the counts are applied with case sensitivity will give different results Note that we need to define carefully what we mean by keys being unique here. This does not mean the key occurs only once; in our data set we may see a key occur numerous times and, as we shall see, the MapReduce model has a stage where all values associated with each key are collected together. The uniqueness of keys guarantees that if we collect together every value seen for any given key, the result will be an association from a single instance of the key to every value mapped in such a way, and none will be omitted. Why key/value data? Using key/value data as the foundation of MapReduce operations allows for a powerful programming model that is surprisingly widely applicable, as can be seen by the adoption of Hadoop and MapReduce across a wide variety of industries and problem scenarios. Much data is either intrinsically key/value in nature or can be represented in such a way. It is a simple model with broad applicability and semantics straightforward enough that programs defined in terms of it can be applied by a framework like Hadoop. Of course, the data model itself is not the only thing that makes Hadoop useful; its real power lies in how it uses the techniques of parallel execution, and divide and conquer. We can have a large number of hosts on which we can store and execute data and even use a framework that manages the division of the larger task into smaller chunks, and the combination of partial results into the overall answer. But we need this framework to provide us with a way of expressing our problems that doesn't require us to be an expert in the execution mechanics; we want to express the transformations required on our data and then let the framework do the rest. MapReduce, with its key/value interface, provides such a level of abstraction, whereby the programmer only has to specify these transformations and Hadoop handles the complex process of applying this to arbitrarily large data sets. Some real-world examples To become less abstract, let's think of some real-world data that is key/value pair: An address book relates a name (key) to contact information (value) A bank account uses an account number (key) to associate with the account details (value) The index of a book relates a word (key) to the pages on which it occurs (value) On a computer filesystem, filenames (keys) allow access to any sort of data, such as text, images, and sound (values) These examples are intentionally broad in scope, to help and encourage you to think that key/value data is not some very constrained model used only in high-end data mining but a very common model that is all around us. We would not be having this discussion if this was not important to Hadoop. The bottom line is that if the data can be expressed as key/value pairs, it can be processed by MapReduce. MapReduce as a series of key/value transformations You may have come across MapReduce described in terms of key/value transformations, in particular the intimidating one looking like this: {K1,V1} -> {K2, List<V2>} -> {K3,V3} We are now in a position to understand what this means: The input to the map method of a MapReduce job is a series of key/value pairs that we'll call K1 and V1. The output of the map method (and hence input to the reduce method) is a series of keys and an associated list of values that are called K2 and V2. Note that each mapper simply outputs a series of individual key/value outputs; these are combined into a key and list of values in the shuffle method. The final output of the MapReduce job is another series of key/value pairs, called K3 and V3 These sets of key/value pairs don't have to be different; it would be quite possible to input, say, names and contact details and output the same, with perhaps some intermediary format used in collating the information. Keep this three-stage model in mind as we explore the Java API for MapReduce next. We will first walk through the main parts of the API you will need and then do a systematic examination of the execution of a MapReduce job. The Hadoop Java API for MapReduce Hadoop underwent a major API change in its 0.20 release, which is the primary interface in the 1.0 version. Though the prior API was certainly functional, the community felt it was unwieldy and unnecessarily complex in some regards. The new API, sometimes generally referred to as context objects, for reasons we'll see later, is the future of Java's MapReduce development. Note that caveat: there are parts of the pre-0.20 MapReduce libraries that have not been ported to the new API, so we will use the old interfaces when we need to examine any of these. The 0.20 MapReduce Java API The 0.20 and above versions of MapReduce API have most of the key classes and interfaces either in the org.apache.hadoop.mapreduce package or its subpackages. In most cases, the implementation of a MapReduce job will provide job-specific subclasses of the Mapper and Reducer base classes found in this package. We'll stick to the commonly used K1/K2/K3/ and so on terminology, though more recently the Hadoop API has, in places, used terms such as KEYIN/VALUEIN and KEYOUT/VALUEOUT instead. For now, we will stick with K1/K2/K3 as it helps understand the end-to-end data flow. The Mapper class This is a cut-down view of the base Mapper class provided by Hadoop. For our own mapper implementations, we will subclass this base class and override the specified method as follows: class Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value Mapper.Context context) throws IOException, InterruptedException {..} } Although the use of Java generics can make this look a little opaque at first, there is actually not that much going on. The class is defined in terms of the key/value input and output types, and then the map method takes an input key/value pair in its parameters. The other parameter is an instance of the Context class that provides various mechanisms to communicate with the Hadoop framework, one of which is to output the results of a map or reduce method. Notice that the map method only refers to a single instance of K1 and V1 key/ value pairs. This is a critical aspect of the MapReduce paradigm in which you write classes that process single records and the framework is responsible for all the work required to turn an enormous data set into a stream of key/ value pairs. You will never have to write map or reduce classes that try to deal with the full data set. Hadoop also provides mechanisms through its InputFormat and OutputFormat classes that provide implementations of common file formats and likewise remove the need of having to write file parsers for any but custom file types. There are three additional methods that sometimes may be required to be overridden. protected void setup( Mapper.Context context) throws IOException, Interrupted Exception This method is called once before any key/value pairs are presented to the map method. The default implementation does nothing. protected void cleanup( Mapper.Context context) throws IOException, Interrupted Exception This method is called once after all key/value pairs have been presented to the map method. The default implementation does nothing. protected void run( Mapper.Context context) throws IOException, Interrupted Exception This method controls the overall flow of task processing within a JVM. The default implementation calls the setup method once before repeatedly calling the map method for each key/value pair in the split, and then finally calls the cleanup method . The Reducer class The Reducer base class works very similarly to the Mapper class, and usually requires only subclasses to override a single reduce method . Here is the cut-down class definition: public class Reducer<K2, V2, K3, V3> { void reduce(K1 key, Iterable<V2> values, Reducer.Context context) throws IOException, InterruptedException {..} } Again, notice the class definition in terms of the broader data flow (the reduce method accepts K2/V2 as input and provides K3/V3 as output) while the actual reduce method takes only a single key and its associated list of values. The Context object is again the mechanism to output the result of the method. This class also has the setup, run, and cleanup methods with similar default implementations as with the Mapper class that can optionally be overridden: protected void setup( Reduce.Context context) throws IOException, InterruptedException This method is called once before any key/lists of values are presented to the reduce method. The default implementation does nothing. protected void cleanup( Reducer.Context context) throws IOException, InterruptedException This method is called once after all key/lists of values have been presented to the reduce method. The default implementation does nothing. protected void run( Reducer.Context context) throws IOException, InterruptedException This method controls the overall flow of processing the task within JVM. The default implementation calls the setup method before repeatedly calling the reduce method for as many key/values provided to the Reducer class, and then finally calls the cleanup method. The Driver class Although our mapper and reducer implementations are all we need to perform the MapReduce job, there is one more piece of code required: the driver that communicates with the Hadoop framework and specifies the configuration elements needed to run a MapReduce job. This involves aspects such as telling Hadoop which Mapper and Reducer classes to use, where to find the input data and in what format, and where to place the output data and how to format it. There is no default parent Driver class as a subclass; the driver logic usually exists in the main method of the class written to encapsulate a MapReduce job. Take a look at the following code snippet as an example driver. Don't worry about how each line works, though you should be able to work out generally what each is doing: public class ExampleDriver { ... public static void main(String[] args) throws Exception { // Create a Configuration object that is used to set other options Configuration conf = new Configuration() ; // Create the object representing the job Job job = new Job(conf, "ExampleJob") ; // Set the name of the main class in the job jarfile job.setJarByClass(ExampleDriver.class) ; // Set the mapper class job.setMapperClass(ExampleMapper.class) ; // Set the reducer class job.setReducerClass(ExampleReducer.class) ; // Set the types for the final output key and value job.setOutputKeyClass(Text.class) ; job.setOutputValueClass(IntWritable.class) ; // Set input and output file paths FileInputFormat.addInputPath(job, new Path(args[0])) ; FileOutputFormat.setOutputPath(job, new Path(args[1])) // Execute the job and wait for it to complete System.exit(job.waitForCompletion(true) ? 0 : 1); } }} Given our previous talk of jobs, it is not surprising that much of the setup involves operations on a Job object. This includes setting the job name and specifying which classes are to be used for the mapper and reducer implementations. Certain input/output configurations are set and, finally, the arguments passed to the main method are used to specify the input and output locations for the job. This is a very common model that you will see often. There are a number of default values for configuration options, and we are implicitly using some of them in the preceding class. Most notably, we don't say anything about the file format of the input files or how the output files are to be written. These are defined through the InputFormat and OutputFormat classes mentioned earlier; we will explore them in detail later. The default input and output formats are text files that suit our WordCount example. There are multiple ways of expressing the format within text files in addition to particularly optimized binary formats. A common model for less complex MapReduce jobs is to have the Mapper and Reducer classes as inner classes within the driver. This allows everything to be kept in a single file, which simplifies the code distribution. Writing MapReduce programs We have been using and talking about WordCount for quite some time now; let's actually write an implementation, compile, and run it, and then explore some modifications. Time for action – setting up the classpath To compile any Hadoop-related code, we will need to refer to the standard Hadoop-bundled classes. Add the Hadoop-1.0.4.core.jar file from the distribution to the Java classpath as follows: $ export CLASSPATH=.:${HADOOP_HOME}/Hadoop-1.0.4.core.jar:${CLASSPATH} What just happened? This adds the Hadoop-1.0.4.core.jar file explicitly to the classpath alongside the current directory and the previous contents of the CLASSPATH environment variable. Once again, it would be good to put this in your shell startup file or a standalone file to be sourced. We will later need to also have many of the supplied third-party libraries that come with Hadoop on our classpath, and there is a shortcut to do this. For now, the explicit addition of the core JAR file will suffice. Time for action – implementing WordCount We will explore our own Java implementation by performing the following steps: Enter the following code into the WordCount1.java file: Import java.io.* ; import org.apache.hadoop.conf.Configuration ; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount1 { public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String[] words = value.toString().split(" ") ; for (String str: words) { word.set(str); context.write(word, one); } } } public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int total = 0; for (IntWritable val : values) { total++ ; } context.write(key, new IntWritable(total)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "word count"); job.setJarByClass(WordCount1.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Now compile it by executing the following command: $ javac WordCount1.java What just happened? This is our first complete MapReduce job. Look at the structure and you should recognize the elements we have previously discussed: the overall Job class with the driver configuration in its main method and the Mapper and Reducer implementations defined as inner classes. We'll do a more detailed walkthrough of the mechanics of MapReduce in the next section, but for now let's look at the preceding code and think of how it realizes the key/value transformations we talked about earlier. The input to the Mapper class is arguably the hardest to understand, as the key is not actually used. The job specifies TextInputFormat as the format of the input data and, by default, this delivers to the mapper data where the key is the line number in the file and the value is the text of that line. In reality, you may never actually see a mapper that uses that line number key, but it is provided. The mapper is executed once for each line of text in the input source and every time it takes the line and breaks it into words. It then uses the Context object to output (more commonly known as emitting) each new key/value of the form <word, 1 >. These are our K2/V2 values. We said before that the input to the reducer is a key and a corresponding list of values, and there is some magic that happens between the map and reduce methods to collect together the values for each key that facilitates this, which we'll not describe right now. Hadoop executes the reducer once for each key and the preceding reducer implementation simply counts the numbers in the Iterable object and gives output for each word in the form of <word, count>. This is our K3/V3 values. Take a look at the signatures of our mapper and reducer classes: the WordCountMapper class gives IntWritable and Text as input and gives Text and IntWritable as output. The WordCountReducer class gives Text and IntWritableboth as input and output. This is again quite a common pattern, where the map method performs an inversion on the key and values, and instead emits a series of data pairs on which the reducer performs aggregation. The driver is more meaningful here, as we have real values for the parameters. We use arguments passed to the class to specify the input and output locations. Time for action – building a JAR file Before we run our job in Hadoop, we must collect the required class files into a single JAR file that we will submit to the system. Create a JAR file from the generated class files. $ jar cvf wc1.jar WordCount1*class What just happened? We must always package our class files into a JAR file before submitting to Hadoop, be it local or on Elastic MapReduce. Be careful with the JAR command and file paths. If you include in a JAR file class the files from a subdirectory, the class may not be stored with the path you expect. This is especially common when using a catch-all classes directory where all source data gets compiled. It may be useful to write a script to change into the directory, convert the required files into JAR files, and move the JAR files to the required location. Time for action – running WordCount on a local Hadoop cluster Now we have generated the class files and collected them into a JAR file, we can run the application by performing the following steps: Submit the new JAR file to Hadoop for execution. $ hadoop jar wc1.jar WordCount1 test.txt output Check the output file; it should be as follows: $ Hadoop fs –cat output/part-r-00000 This 1 yes 1 a 1 is 2 test 1 this 1 What just happened? This is the first time we have used the Hadoop JAR command with our own code. There are four arguments: The name of the JAR file. The name of the driver class within the JAR file. The location, on HDFS, of the input file (a relative reference to the /user/Hadoop home folder, in this case). The desired location of the output folder (again, a relative path). The name of the driver class is only required if a main class has not (as in this case) been specified within the JAR file manifest.
Read more
  • 0
  • 0
  • 5275

article-image-so-what-mongodb
Packt
02 Aug 2013
6 min read
Save for later

So, what is MongoDB?

Packt
02 Aug 2013
6 min read
(For more resources related to this topic, see here.) What is a document? While it may vary for various implementations of different Document Oriented Databases available, as far as MongoDB is concerned it is a BSON document, which stands for Binary JSON. JSON (JavaScript Object Notation) is an open standard developed for human readable data exchange. Though a thorough knowledge of JSON is not really important to understand MongoDB, for keen readers the URL to its RFC is http://tools.ietf.org/html/rfc4627. Also, the BSON specification can be found at http://bsonspec.org/. Since MongoDB stores the data as BSON documents, it is a Document Oriented Database. What does a document look like? Consider the following example where we represent a person using JSON: {"firstName":"Jack","secondName":"Jones","age":30,"phoneNumbers":[{fixedLine:"1234"},{mobile:"5678"}],"residentialAddress":{lineOne:"…",lineTwo:"…",city:"…",state:"…",zip:"…",country:"…"}} As we can see, a JSON document always starts and ends with curly braces and has all the content within these braces. Multiple fields and values are separated by commas, with a field name always being a string value and the value being of any type ranging from string, numbers, date, array, another JSON document, and so on. For example in "firstName":"Jack", the firstName is the name of the field whereas Jack is the value of the field. Need for MongoDB Many of you would probably be wondering why we need another database when we already have good old relational databases. We will try to see a few drivers from its introduction back in 2009. Relational databases are extremely rich in features. But these features don't come for free; there is a price to pay and it is done by compromising on the scalability and flexibility. Let us see these one by one. Scalability It is a factor used to measure the ease with which a system can accommodate the growing amount of work or data. There are two ways in which you can scale your system: scale up, also known as scale vertically or scale out, also known as scale horizontally. Vertical scalability can simply be put up as an approach where we say "Need more processing capabilities? Upgrade to a bigger machine with more cores and memory". Unfortunately, with this approach we hit a wall as it is expensive and technically we cannot upgrade the hardware beyond a certain level. You are then left with an option to optimize your application, which might not be a very feasible approach for some systems which are running in production for years. On the other hand, Horizontal scalability can be described as an approach where we say "Need more processing capabilities? Simple, just add more servers and multiply the processing capabilities". Theoretically this approach gives us unlimited processing power but we have more challenges in practice. For many machines to work together, there would be a communication overhead between them and the probability of any one of these machines being down at a given point of time is much higher. MongoDB enables us to scale horizontally easily, and at the same time addresses the problems related to scaling horizontally to a great extent. The end result is that it is very easy to scale MongoDB with increasing data as compared to relational databases. Ease of development MongoDB doesn't have the concept of creation of schema as we have in relational databases. The document that we just saw can have an arbitrary structure when we store them in the database. This feature makes it very easy for us to model and store relatively unstructured/ complex data, which becomes difficult to model in a relational database. For example, product catalogues of an e-commerce application containing various items and each having different attributes. Also, it is more natural to use JSON in application development than tables from relational world. Ok, it looks good, but what is the catch? Where not to use MongoDB? To achieve the goal of letting MongoDB scale out easily, it had to do away with features like joins and multi document/distributed transactions. Now, you must be wondering it is pretty useless as we have taken away two of the most important features of the relational database. However, to mitigate the problems of joins is one of the reasons why MongoDB is document oriented. If you look at the preceding JSON document for the person, we have the address and the phone number as a part of the document. In relational database, these would have been in separate tables and retrieved by joining these tables together. Distributed/Multi document transactions inhibit MongoDB to scale out and hence are not supported and nor there is a way to mitigate it. MongoDB still is atomic but the atomicity for inserts and updates is guaranteed at document level and not across multiple documents. Hence, MongoDB is not a good fit for scenarios where complex transactions are needed, such as in an OLTP banking applications. This is an area where good old relational database still rules. To conclude, let us take a look at the following image. This graph is pretty interesting and was presented by Dwight Merriman, Founder and CEO of 10gen, the MongoDB company in one of his online courses. As we can see, we have on one side some products like Memcached which is very low on functionality but high on scalability and performance. On the other end we have RDBMS (Relational Database Management System) which is very rich in features but not that scalable. According to the research done while developing MongoDB, this graph is not linear and there is a point in it after which the scalability and performance fall steeply on adding more features to the product. MongoDB sits on this point where it gives maximum possible features without compromising too much on the scalability and performance. Summary In this article, we saw the features displayed by MongoDB, how a document looks like, and how it is better than relational databases. Resources for Article : Further resources on this subject: Building a Chat Application [Article] Ruby with MongoDB for Web Development [Article] Comparative Study of NoSQL Products [Article]
Read more
  • 0
  • 0
  • 3355
Modal Close icon
Modal Close icon