Data | 0 articles | Tech News, Tutorials & Expert Insights

06 Sep 2013

5 min read

Recursive queries

06 Sep 2013

(For more resources related to this topic, see here.) XML is one solution to dealing with hierarchical data, but it isn't the most natural for the relational database. Instead, you often wind up with nested categories, or filesystem-like folder hierarchies, or links back to older records. A popular way to structure a relational database for data with this shape is using a self reference, a link back to a parent object in the same table.For instance, you might want to model categories that can have subcategories with arbitrary nesting. A simple table might look like this: CREATE TABLE categories ( id SERIAL PRIMARY KEY, name VARCHAR, parent_id INTEGER REFERENCES categories); What makes this structure recursive is that self-referencing parent_id column, which refers to the table we're defining from within its own definition. We would treat categories with a NULL value for the parent_id column as top-level categories, categories that do not belong within any other category. To get a feel for the kind of data, let's put a few categories in there for an online retailer. Say we sell shirts and books, and books we further divide into fiction and non-fiction, and then we'll put programming books inside non-fiction. It might look like the following: INSERT INTO categories (id, name, parent_id) VALUES(1, 'Shirts', NULL),(2, 'Books', NULL),(3, 'Fiction', 2),(4, 'Non-fiction', 2),(5, 'Programming', 4); Usually you won't put these in manually like this, but you can see that Fiction and Non-fiction are children of the Books category because their parent_id values are 2, the value for Books is NULL and Programming has the parent_id value as 4, which is the id value for Non-fiction. Suppose you want to make navigation breadcrumbs given a category. You need to look up that category and then you need to look up the parent category. You need to do this until the parent category is NULL. In a procedural pseudocode, the process might look like the following: def write_breadcrumbs(category): row = getOne("SELECT * FROM categories WHERE id = ?", category) while row['parent_id'] != NULL: write(row['name']) write(' > ') row = getOne("SELECT * FROM categories WHERE id = ?", row['parent_id']) This kind of solution leads to the N+1 query problem. There's an action you want to take for a particular value, but to take that action you have to run an arbitrary number of separate queries. Recursive queries in PostgreSQL provide you with a way to have the database do the heavy lifting instead. Because we'll be using recursion in the SQL, let's first see what a recursion formulation would look like in our pseudocode: def write_breadcrumbs(category): row = getOne("SELECT * FROM categories WHERE id = ?", category) write(row['name']) if row['parent_id'] != NULL: write(' > ') write_breadcrumbs(row['parent_id']) It's debatable whether this is better code; it's shorter, and it has fewer bugs, but it also might expose the developer to the possibility of stack overflows. Recursive functions always have some similar structure though, some number of base cases that do not call the function recursively and some number of inductive cases that work by calling the same function again on slightly different data. The final destination will be one of the base cases. The inductive cases will peel off a small piece of the problem and solve it, then delegate the rest of the work to an invocation of the same function. PostgreSQL's recursive query support works with something called the common table expressions (CTEs). The idea is to make a named alias for a query. We won't delve into the details too much here, but all recursive queries will have the same basic structure: WITH RECURSIVE recursive-query-name AS ( SELECT <base-case> FROM table UNION ALL SELECT <inductive-case> FROM table JOIN <recursive-query-name> ON )SELECT * FROM <recursive-query-name>; For an example, let's get all the categories above Programming. The base case will be the Programming category itself. The inductive case will be to find the parent of a category we've already seen: WITH RECURSIVE programming_parents AS ( SELECT * FROM categories WHERE id = 5 UNION ALL SELECT categories.* FROM categories JOIN programming_parents ON programming_parents.parent_id = categories.id)SELECT * FROM programming_parents; This works as we'd hope. Output of a simple recursive query Without using this trick we'd have to do three separate queries to get this information, but with the trick it will always take one query no matter how deeply nested the categories are. We can also go in the other direction and build up something like a tree underneath a category by searching for categories that have a category we've already seen as a parent category. We can make the hierarchy more explicit by building up a path as we go: WITH RECURSIVE all_categories AS ( SELECT *, name as path FROM categories WHERE parent_id IS NULL UNION ALL SELECT c.*, p.path || '/' || c.name FROM categories AS c JOIN all_categories p ON p.id = c.parent_id)SELECT * FROM all_categories; Finding all the categories with their path We can be more discriminating with the search by looking for a particular category as the starting value in the base case: WITH RECURSIVE all_categories AS ( SELECT *, name as path FROM categories WHERE id = 2 UNION ALL SELECT c.*, p.path || '/' || c.name FROM categories AS c JOIN all_categories p ON p.id = c.parent_id)SELECT * FROM all_categories; The category tree rooted at Books This is hugely useful with hierarchical data. Summary In this article we learned about recursive queries as one of the most important features of PostgreSQL. Resources for Article: Further resources on this subject: Introduction to PostgreSQL 9 [Article] PostgreSQL: Tips and Tricks [Article] Obtaining a binary backup [Article]

0
0
5701

Packt

06 Sep 2013

4 min read

Regression models in Weka

Packt

06 Sep 2013

4 min read

(For more resources related to this topic, see here.) Getting ready Let's look at an example of a house price-based regression model, and create some real data to examine. These are actual numbers from houses for sale, and we will be trying to find the value of a house we are supposed to sell: Size (m2) Land (m2) Rooms Granite Extra bathroom Price 1076 2801 6 0 0 €324.500,00 990 3067 5 1 1 €466.000,00 1229 3094 5 0 1 €425.900,00 731 4315 4 1 0 €387.120,00 671 2926 4 0 1 €312.100,00 1078 6094 6 1 1 €603.000,00 909 2854 5 0 1 €383.400,00 975 2947 5 1 1 ?? To load files in Weka, we have to put the table in the ARFF file format and save it as house.arff. Make sure the attributes are numeric, as shown here: @RELATION house@ATTRIBUTE size NUMERIC@ATTRIBUTE land NUMERIC@ATTRIBUTE rooms NUMERIC@ATTRIBUTE granite NUMERIC@ATTRIBUTE extra_bathroom NUMERIC@ATTRIBUTE price NUMERIC@DATA1076,2801,6,0,0,324500990,3067,5,1,1,4660001229,3094,5,0,1,425900731,4315,4,1,0,387120671,2926,4,0,1,3121001078,6094,6,1,1,603000909,2854,5,0,1,383400975,2947,5,1,1,? How to do it… Use the following snippet: import java.io.BufferedReader;import java.io.FileReader;import weka.core.Instance;import weka.core.Instances;import weka.classifiers.functions.LinearRegression;public class Regression{public static void main(String args[]) throws Exception{//load dataInstances data = new Instances(new BufferedReader(newFileReader("dataset/house.arff")));data.setClassIndex(data.numAttributes() - 1);//build modelLinearRegression model = new LinearRegression();model.buildClassifier(data); //the last instance with missingclass is not usedSystem.out.println(model);//classify the last instanceInstance myHouse = data.lastInstance();double price = model.classifyInstance(myHouse);System.out.println("My house ("+myHouse+"): "+price);}} Here is the output: Linear Regression Modelprice =195.2035 * size +38.9694 * land +76218.4642 * granite +73947.2118 * extra_bathroom +2681.136My house (975,2947,5,1,1,?): 458013.16703945777 How it works… Import a basic regression model named weka.classifiers.functions.LinearRegression: import java.io.BufferedReader;import java.io.FileReader;import weka.core.Instance;import weka.core.Instances;import weka.classifiers.functions.LinearRegression; Load the house dataset: Instances data = new Instances(new BufferedReader(newFileReader("dataset/house.arff")));data.setClassIndex(data.numAttributes() - 1); Initialize and build a regression model. Note, that the last instance is not used for building the model since the class value is missing: LinearRegression model = new LinearRegression();model.buildClassifier(data); Output the model: System.out.println(model); Use the model to predict the price of the last instance in the dataset: Instance myHouse = data.lastInstance();double price = model.classifyInstance(myHouse);System.out.println("My house ("+myHouse+"): "+price); There’s more This section lists some additional algorithms. Other regression algorithms There is a wide variety of implemented regression algorithms one can use in Weka: weka.classifiers.rules.ZeroR: The class for building and using an 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class) and it is considered as a baseline; that is, if your classifier's performance is worse than average value predictor, it is not worth considering it. weka.classifiers.trees.REPTree: The fast decision tree learner. Builds a decision/regression tree using information gain/variance and prunes it using reduced-error pruning (with backfitting). It only sorts values for numeric attributes once. Missing values are dealt with by splitting the corresponding instances into pieces (that is, as in C4.5). weka.classifiers.functions.SMOreg: SMOreg implements the support vector machine for regression. The parameters can be learned using various algorithms. The algorithm is selected by setting the RegOptimizer. The most popular algorithm (RegSMOImproved) is due to Shevade, Keerthi, and others, and this is the default RegOptimizer. weka.classifiers.functions.MultilayerPerceptron: A classifier that uses backpropagation to classify instances. This network can be built by hand, or created by an algorithm, or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid (except for when the class is numeric in which case the output nodes become unthresholded linear units). weka.classifiers.functions.GaussianProcesses: Implements Gaussian Processes for regression without hyperparameter-tuning. Summary We learned how to use models that predict a value of numerical class, in contrast to classification, which predicts the value of a nominal class. Given a set of attributes, the regression builds a model, usually an equation that is used to compute the predicted class value. Resources for Article: Further resources on this subject: Java in Oracle Database [Article] Installing and Setting up JavaFX for NetBeans and Eclipse IDE [Article] Getting Started with JavaFX [Article]

0
0
10835

article-image-out-process-distributed-caching

Packt

06 Sep 2013

7 min read

Out-of-process distributed caching

Packt

06 Sep 2013

7 min read

(For more resources related to this topic, see here.) Getting ready Out-of-process caching is a way of distributing your caching needs in a different JVM and/or infrastructure. Ehcache provides a convenient deployable WAR file that works on most web containers/standalone servers whose mission is to provide an easy API interface to distributed cache. At the moment of writing, you can download it from http://sourceforge.net/projects/ehcache/files/ehcache-server/, or you can include it in your Maven POM and will be delivered as a WAR file. The cache server requires no special configuration on the Tomcat container. However, if you are running GlassFish, Jetty, WebLogic, or any other application server (or servlet container), there are minimal configuration changes to do. Please refer to the Ehcache cache server documentation for details on these. While using the RESTful interface, it is important to note that you have three ways to set the MIME type for exchanging data back and forth to the cache server, namely Text/XML, application/JSON, and application/x-java-serialized-object. You can use any programming language to invoke the web service interface and cache your objects (except for application/x-java-serialized-object for obvious reasons). Refer to the recipe8 project directory within the source code bundle for a fully working sample of this recipe content and further information related to this topic. How to do it... Add Ehcache and Ehcache cache server dependencies to your POM.xml file. <dependency> <groupId>net.sf.ehcache</groupId> <artifactId>ehcache-server</artifactId> <version>1.0.0</version> <type>war</type> </dependency> <dependency> <groupId>net.sf.ehcache</groupId> <artifactId>ehcache</artifactId> <version>2.6.0</version> <type>pom</type> </dependency> Edit ehcache.xml in the cache server to hold your cache setup (the cache name is very important).You can find this file here: ${CACHE_SERVER}/WEB-INF/classes/ehcache.xml. <?xml version="1.0" encoding="UTF-8"?> <ehcache xsi_noNamespaceSchemaLocation="ehcache.xsd" updateCheck="true" monitoring="autodetect" dynamicConfig="true">  <cache name="remoteCache" maxElementsInMemory="10000" eternal="true" diskPersistent="true" overflowToDisk="true"/> ... Disable the SOAP interface in the cache server web.xml (since we are going to use RESTful) file:You can find this file here: ${CACHE_SERVER}/WEB-INF/web.xml. <?xml version="1.0" encoding="UTF-8"?> <web-app xsi_schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd" version="2.5"> ...  ... Make your objects-to-be-cached serializable: import java.io.Serializable; public final class Item implements Serializable { Invoke the RESTful (or SOAP) interface to save/retrieve/delete cached objects: ... public void saveItemInCache(String key, Serializable item) { //sample URL: http://localhost:8080/ehcache/rest/cacheName/{id} //here cacheName is the cache name you set up in the cache-server ehcache.xml String url = CACHE_SERVER_URL + "cacheName" + "/" + key; //initialize Apache HTTP Client client = new DefaultHttpClient(); //create Cache Element to be sent Element element = new Element(key, item); //serialize object to be sent to EhCache Server byte[] itemToByteArray = SerializationUtils.serialize(element); //create PUT request HttpPut putRequest = new HttpPut(url); //set header to read java-serialized-object mime type putRequest.setHeader ("Content-Type", "application/x-java-serialized-object"); ... How it works... The Ehcache cache server utility is a versatile tool that lets us distribute cache engines in a very flexible way. It provides a very simple API exposure via RESTful or SOAP-based web services. We start by editing the ehcache.xml configuration file within the cache server application by adding a cache that we would like to use for our cached objects: ...  <cache name="remoteCache" maxElementsInMemory="10000" eternal="true" diskPersistent="true" overflowToDisk="true"/> ... The cache name defined here is very important because this will be the endpoint of our RESTful URL pattern that the cache server will identify and use. Then, we need to edit the web.xml file within the cache server application (located in {CACHE-SERVER}/WEB-INF/) in order to comment out (or completely remove) service definitions that we are not going to use (that is, SOAP if you are using RESTful or vice versa).  <cache name="remoteCache" maxElementsInMemory="10000" ... Now, your URL would be something like this: //sample URL: http://localhost:8080/ehcache/rest/remoteCache/{id} Here, id is just the key value you assign to the cached object. Finally, you just use any http/SOAP client library (or Java default Net API classes) to invoke the web service. In the case of RESTful services, you need to be aware that the HTTP method sent determines whether you are storing, updating, retrieving, or deleting a cached item. They are as follows: GET /{cache}/{element}: This retrieves an object by its key from the O-O-P cache layer. PUT /{cache}/{element}: This stores an item in the O-O-P cache layer. DELETE /{cache}/{element}: This deletes an item from the O-O-P cache layer. HEAD /{cache}/{element}: This retrieves metadata (cache configuration values) from the O-O-P cache layer. OPTIONS /{cache}/{element}: This returns the WADL describing operations. For changing the context you can edit the file ${CACHE_SERVER}/META-INF/context.xml and place your desired context name. As for security, look for the file ${CACHE_SERVER}/WEB-INF/server_security_config.xml_rename_to_activate and open it to read the instructions. Summary This article provided details on implementing distributed caching using the Ehcache server, and also explained in brief what out-of-process caching is. Resources for Article : Further resources on this subject: PostgreSQL 9: Reliable Controller and Disk Setup [Article] Play Framework: Data Validation Using Controllers [Article] Building Applications with Spring Data Redis [Article]

0
0
2130

article-image-using-indexes-manipulate-pandas-objects

Packt

05 Sep 2013

4 min read

Using indexes to manipulate pandas objects

Packt

05 Sep 2013

4 min read

(For more resources related to this topic, see here.) Getting ready A good understanding of indexes in pandas is crucial to quickly move the data around. From a business intelligence perspective, they create a distinction similar to that of metrics and dimensions in an OLAP cube. To illustrate this point, this recipe walks through getting stock data out of pandas, combining it, then reindexing it for easy chomping. How to do it... Use the DataReader object to transfer stock price information into a DataFrame and to explore the basic axis of Panel. > from pandas.i git push -u origin master o.data import DataReader > tickers = ['gs', 'ibm', 'f', 'ba', 'axp'] > dfs = {} > for ticker in tickers: dfs[ticker] = DataReader(ticker, "yahoo", '2006-01-01') # a yet undiscussed data structure, in the same way the a # DataFrame is a collection of Series, a Panel is a collection of # DataFrames > pan = pd.Panel(dfs) > pan <class 'pandas.core.panel.Panel'> Dimensions: 5 (items) x 1764 (major_axis) x 6 (minor_axis)Items axis: axp to ibm Major_axis axis: 2006-01-03 00:00:00 to 2013-01-04 00:00:00 Minor_axis axis: Open to Adj Close > pan.items Index([axp, ba, f, gs, ibm], dtype=object) > pan.minor_axis Index([Open, High, Low, Close, Volume, Adj Close], dtype=object) > pan.major_axis <class 'pandas.tseries.index.DatetimeIndex'>[2006-01-03 00:00:00, ..., 2013-01-04 00:00:00] Length: 1764, Freq: None, Timezone: None Use the axis selectors to easily compute different sets of summary statistics. > pan.minor_xs('Open').mean() axp 46.227466 ba 70.746451 f 9.135794 gs 151.655091 ibm 129.570969 # major axis is sliceable as well > day_slice = pan.major_axis[1] > pan.major_xs(day_slice)[['gs', 'ba']] ba gs Open 70.08 127.35 High 71.27 128.91 Low 69.86 126.38 Close 71.17 127.09 Volume 3165000.00 4861600.00 Adj Close 60.43 118.12 Convert the Panel to a DataFrame. > dfs = [] > for df in pan: idx = pan.major_axis idx = pd.MultiIndex.from_tuples(zip([df]*len(idx), idx)) idx.names = ['ticker', 'timestamp'] dfs.append(pd.DataFrame(pan[df].values, index=idx, columns=pan.minor_axis)) > df = pd.concat(dfs) > df Data columns: Open 8820 non-null values High 8820 non-null values Low 8820 non-null values Close 8820 non-null values Volume 8820 non-null values Adj Close 8820 non-null values dtypes: float64(6) Perform the analogous operations as in the preceding examples on the newly created DataFrame. # selecting from a MultiIndex isn't much different than the Panel # (output muted) > df.ix['gs':'ibm'] > df['Open'] How it works... The previous example was certainly contrived, but when indexing and statistical techniques are incorporated, the power of pandas begins to come through. Statistics will be covered in an upcoming recipe. pandas' indexes by themselves can be thought of as descriptors of a certain point in the DataFrame. When ticker and timestamp are the only indexes in a DataFrame, then the point is individualized by the ticker, timestamp, and column name. After the point is individualized, it's more convenient for aggregation and analysis. There's more... Indexes show up all over the place in pandas so it's worthwhile to see some other use cases as well. Advanced header indexes Hierarchical indexing isn't limited to rows. Headers can also be represented by MultiIndex, as shown in the following command line: > header_top = ['Price', 'Price', 'Price', 'Price', 'Volume', 'Price'] > df.columns = pd.MultiIndex.from_tuples(zip(header_top, df.columns) Performing aggregate operations with indexes As a prelude to the following sections, we'll do a single groupby function here since they work with indexes so well. > df.groupby(level=['tickers', 'day'])['Volume'].mean() This answers the question for each ticker and for each day (not date), that is, what was the mean volume over the life of the data. Summary This article talks about the use and importance of indexes in pandas. It also talks about different operations that can be done with indexes. Resources for Article : Further resources on this subject: Installing Panda3D [Article] Setting Up Panda3D and Configuring Development Tools [Article] Collision Detection and Physics in Panda3D Game Development [Article]

0
0
3109

Packt

04 Sep 2013

7 min read

Quick start – writing your first MDX query

Packt

04 Sep 2013

7 min read

(For more resources related to this topic, see here.) Step 1 – open the SQL Server Management Studio and connect to the cube The Microsoft SQL Server Management Studio (SSMS) is a client application used by the administrators to manage instances and by developers to create object and write queries. We will use SSMS to connect on the cube and write our first MDX query. Here's a screenshot of SSMS with a connection on a SSAS server: Click on the Windows button, click on All Programs, click on Microsoft SQL Server 2012, and then click on SQL Server Management Studio. In the Connect to Server window, in the Server type box, select Analysis Services. In the Server name box, type the name of your Analysis Services server. Click on Connect. In the SQL Server Management Studio window, click on the File menu, click on New, and then click on Analysis Services MDX Query. In the Connect to Analysis Services window, in the Server name box, type the name of you Analysis Services server, and then click on Connect. SELECT FROM WHERE If you have already written SQL queries, you might have already made connections with the T-SQL language. Here's my tip for you: don't, you will only hurt yourself. Some words are the same, but it is better to think MDX when writing MDX rather than to think SQL when writing MDX. Step 2 – SELECT The SELECT clause is the main part of the MDX query. You will define what are the measure and dimension members that you want to display. You also have to define on which axis of your result set you want to display the measure and dimension members. Axes Axes are the columns and rows of the result set. With SQL Server Analysis Services, upto 128 axes can be specified. The axes have a number which is zero-based. The first axe is 0, the second on is 1, and so on. So, if you want to use two axes, the first one will be 0 and the second will be 1. You cannot use axe 0 and axe 2, if you don't define axe 1. For the first five axes, you can use the axis alias instead. After the axe 4, you will have to revert to the number because no other aliases are available. Axe Number Alias 0 Columns 1 Rows 2 Pages 3 Sections 4 Chapters Even if SSAS supports 128 axes, if you try to use more than two axes in SSMS in your query, you will get this error when you execute your MDX query: Results cannot be displayed for cellsets with more than two axes. So, always write your MDX queries using only two axes in SSMS and separate them with a comma. Tuples A tuple is a specific point in the cube where dimensions meet. A tuple can contain one or more members from the cube's dimensions, but you cannot have two members from the same dimension. If you want to display only the calendar year 2008, you will have to write [Date].[CY 2008]. If you want to have more than one dimension, you have to enclose them using parenthesis () and separate them with a comma. Calendar year for United States will look like ([Date].[CY 2008], [Geography].[United States]). Even if you are writing a tuple with only a single member from a single dimension, it is good practice to enclose it in parenthesis. Sets If you want to display the year 2005 to 2008, you will write four single-dimension tuples which composes a set. When writing the set, you separate the tuples with commas and wrap it all with curly braces {} and separate the tuples with commas such as {[Date].[CY 2005], [Date].[CY 2006] , [Date].[CY 2007] , [Date].[CY 2008]} to have the calendar years from 2005 to 2008. Since all the tuples are from the same dimension, you can also write it using a colon (:), such as {[Date].[CY 2005]: [Date].[CY 2008]} which will give you the years 2005 to 2008. With SSAS 2012, you can write {[Date].[CY 2008]: [Date].[CY 2005]} and the result will still be from 2005 to 2008. What about the calendar year 2008 for both Canada and the United States? You will write two tuples. A set can be composed of one or more tuples. The tuples must have the same dimensionality; otherwise, an error will occur. Meaning that the first member is from the Date dimension and the second from the Geography dimension. You cannot have the first tuple with Date-Geography and the second being Geography-Date; you will encounter an error. So the calendar year 2008 with Canada and United States will look such as {([Date].[CY 2008], [Geography].[Canada]), ([Date].[CY 2008], [Geography].[United States])}. When writing tuples, always use the form [Dimension].[Level].[MemberName]. So, [Geography].[Canada] should be written as [Geography].[Country].[Canada]. You could also use the member key instead of the member name. In SSAS, use the ampersand (&) when using the key; [Geography].[State-Province].[Quebec] with the name becomes [Geography].[State-Province].&[QC]&[CA] using the keys. What happens when you want to write bigger sets such as for the bikes and components product category in Canada and the United States from 2005 to 2008? Enter the Crossjoin function. Crossjoin takes two or more sets for arguments and returns you a set with the cross products or the specified sets. Crossjoin ({[Product].[Category].[Bikes], [Product].[Category].[Components]}, {[Geography].[Country].[Canada], [Geography].[Country].[United States]}, {[Date].[CY 2005] : [Date].[CY 2008]}) The MDX queries can be written using line-break to add visibility to the code. So each time we write a new set and even tuples, we write it on a new line and add some indentation: Crossjoin ({[Product].[Category].[Bikes], [Product].[Category].[Components]},{[Geography].[Country].[Canada], [Geography].[Country].[United States]}, {[Date].[CY 2005] : [Date].[CY 2008]}) Step 3 – FROM The FROM clause defines where the query will get the data. It can be one of the following four things: A cube. A perspective (a subset of dimensions and measures). A subcube (a MDX query inside a MDX query). A dimension (a dimension inside your SSAS database, you must use the dollar sign ($) before the name of the dimension). Step 4 – WHERE The WHERE clause is used to filter the dimensions and members out of the MDX query. The set used in the WHERE clause won't be displayed in your result set. Step 5 – comments Comment your code. You never know when somebody else will take a look on your queries and trying to understand what has been written could be harsh. There are three ways to use delimit comments inside the query: /* and */ // -- (pair of dashes) The /* and */ symbols can be used to comment multiple lines of text in your query. Everything between the /* and the */ symbols will be ignored when the MDX query is parsed. Use // or -- to begin a comment on a single line. Step 6 – your first MDX query So if you want to display the Resellers Sales Amount and Reseller Order Quantity measures on the columns, the years from 2006 to 2008 with the bikes and components product categories for Canada. First, identify what will go where. Start with the two axes, continue with the FROM clause, and finish with the WHERE clause. SELECT{[Measures].[Reseller Sales Amount], [Measures].[Reseller Order Quantity]} on columns,Crossjoin({[Date].[CY 2006] : [Date].[CY 2008]}, {[Product].[Category].[Bikes], [Product].[Category].[Components]}) on rowsFROM [Adventure Works]WHERE {[Geography].[Country].[Canada]} This query will return the following result set: Reseller Sales Amount Reseller Order Quantity CY 2006 Bikes $3,938,283.99 4,563 CY 2006 Components $746,576.15 2,954 CY 2007 Bikes $4,417,665.71 5,395 CY 2007 Components $997,617.89 4,412 CY 2008 Bikes $1,909,709.62 2,209 CY 2008 Components $370,698.68 1,672 Summary In this article, we saw how to write the MDX queries in various steps. We used the FROM, WHERE, and SELECT clauses in writing the queries. This article was a quick start guide for starting to query and it will help you write more complex queries. Happy querying! Resources for Article : Further resources on this subject: Connecting to Microsoft SQL Server Compact 3.5 with Visual Studio [Article] MySQL Linked Server on SQL Server 2008 [Article] Microsoft SQL Azure Tools [Article]

0
0
9984

article-image-understanding-point-time-recovery

Packt

04 Sep 2013

28 min read

Understanding Point-In-Time-Recovery

Packt

04 Sep 2013

28 min read

0
0
5261

article-image-integrating-storm-and-hadoop

Packt

04 Sep 2013

17 min read

Integrating Storm and Hadoop

Packt

04 Sep 2013

17 min read

(For more resources related to this topic, see here.) In this article, we will implement the Batch and Service layers to complete the architecture. There are some key concepts underlying this big data architecture: Immutable state Abstraction and composition Constrain complexity Immutable state is the key, in that it provides true fault-tolerance for the architecture. If a failure is experienced at any level, we can always rebuild the data from the original immutable data. This is in contrast to many existing data systems, where the paradigm is to act on mutable data. This approach may seem simple and logical; however, it exposes the system to a particular kind of risk in which the state is lost or corrupted. It also constrains the system, in that you can only work with the current view of the data; it isn't possible to derive new views of the data. When the architecture is based on a fundamentally immutable state, it becomes both flexible and fault-tolerant. Abstractions allow us to remove complexity in some cases, and in others they can introduce complexity. It is important to achieve an appropriate set of abstractions that increase our productivity and remove complexity, but at an appropriate cost. It must be noted that all abstractions leak, meaning that when failures occur at a lower abstraction, they will affect the higher-level abstractions. It is therefore often important to be able to make changes within the various layers and understand more than one layer of abstraction. The designs we choose to implement our abstractions must therefore not prevent us from reasoning about or working at the lower levels of abstraction when required. Open source projects are often good at this, because of the obvious access to the code of the lower level abstractions, but even with source code available, it is easy to convolute the abstraction to the extent that it becomes a risk. In a big data solution, we have to work at higher levels of abstraction in order to be productive and deal with the massive complexity, so we need to choose our abstractions carefully. In the case of Storm, Trident represents an appropriate abstraction for dealing with the data-processing complexity, but the lower level Storm API on which Trident is based isn't hidden from us. We are therefore able to easily reason about Trident based on an understanding of lower-level abstractions within Storm. Another key issue to consider when dealing with complexity and productivity is composition. Composition within a given layer of abstraction allows us to quickly build out a solution that is well tested and easy to reason about. Composition is fundamentally decoupled, while abstraction contains some inherent coupling to the lower-level abstractions—something that we need to be aware of. Finally, a big data solution needs to constrain complexity. Complexity always equates to risk and cost in the long run, both from a development perspective and from an operational perspective. Real-time solutions will always be more complex than batch-based systems; they also lack some of the qualities we require in terms of performance. Nathan Marz's Lambda architecture attempts to address this by combining the qualities of each type of system to constrain complexity and deliver a truly fault-tolerant architecture. We divided this flow into preprocessing and "at time" phases, using streams and DRPC streams respectively. We also introduced time windows that allowed us to segment the preprocessed data. In this article, we complete the entire architecture by implementing the Batch and Service layers. The Service layer is simply a store of a view of the data. In this case, we will store this view in Cassandra, as it is a convenient place to access the state alongside Trident's state. The preprocessed view is identical to the preprocessed view created by Trident, counted elements of the TF-IDF formula (D, DF, and TF), but in the batch case, the dataset is much larger, as it includes the entire history. The Batch layer is implemented in Hadoop using MapReduce to calculate the preprocessed view of the data. MapReduce is extremely powerful, but like the lower-level Storm API, is potentially too low-level for the problem at hand for the following reasons: We need to describe the problem as a data pipeline; MapReduce isn't congruent with such a way of thinking Productivity We would like to think of a data pipeline in terms of streams of data, tuples within the stream and predicates acting on those tuples. This allows us to easily describe a solution to a data processing problem, but it also promotes composability, in that predicates are fundamentally composable, but pipelines themselves can also be composed to form larger, more complex pipelines. Cascading provides such an abstraction for MapReduce in the same way as Trident does for Storm. With these tools, approaches, and considerations in place, we can now complete our real-time big data architecture. There are a number of elements, that we will update, and a number of elements that we will add. The following figure illustrates the final architecture, where the elements in light grey will be updated from the existing recipe, and the elements in dark grey will be added in this article: Implementing TF-IDF in Hadoop TF-IDF is a well-known problem in the MapReduce communities; it is well-documented and implemented, and it is interesting in that it is sufficiently complex to be useful and instructive at the same time. Cascading has a series of tutorials on TF-IDF at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/, which documents this implementation well. For this recipe, we shall use a Clojure Domain Specific Language (DSL) called Cascalog that is implemented on top of Cascading. Cascalog has been chosen because it provides a set of abstractions that are very semantically similar to the Trident API and are very terse while still remaining very readable and easy to understand. Getting ready Before you begin, please ensure that you have installed Hadoop by following the instructions at http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/. How to do it… Start by creating the project using the lein command: lein new tfidf-cascalog Next, you need to edit the project.clj file to include the dependencies: (defproject tfidf-cascalog "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.4.0"] [cascalog "1.10.1"] [org.apache.cassandra/cassandra-all "1.1.5"] [clojurewerkz/cassaforte "1.0.0-beta11-SNAPSHOT"] [quintona/cascading-cassandra "0.0.7-SNAPSHOT"] [clj-time "0.5.0"] [cascading.avro/avro-scheme "2.2-SNAPSHOT"] [cascalog-more-taps "0.3.0"] [org.apache.httpcomponents/httpclient "4.2.3"]] :profiles{:dev{:dependencies[[org.apache.hadoop/hadoop-core "0.20.2-dev"] [lein-midje "3.0.1"] [cascalog/midje-cascalog "1.10.1"]]}}) It is always a good idea to validate your dependencies; to do this, execute lein deps and review any errors. In this particular case, cascading-cassandra has not been deployed to clojars, and so you will receive an error message. Simply download the source from https://github.com/quintona/cascading-cassandra and install it into your local repository using Maven. It is also good practice to understand your dependency tree. This is important to not only prevent duplicate classpath issues, but also to understand what licenses you are subject to. To do this, simply run lein pom, followed by mvn dependency:tree. You can then review the tree for conflicts. In this particular case, you will notice that there are two conflicting versions of Avro. You can fix this by adding the appropriate exclusions: [org.apache.cassandra/cassandra-all "1.1.5" :exclusions [org.apache.cassandra.deps/avro]] We then need to create the Clojure-based Cascade queries that will process the document data. We first need to create the query that will create the "D" view of the data; that is, the D portion of the TF-IDF function. This is achieved by defining a Cascalog function that will output a key and a value, which is composed of a set of predicates: (defn D [src] (let [src (select-fields src ["?doc-id"])] (<- [?key ?d-str] (src ?doc-id) (c/distinct-count ?doc-id :> ?n-docs) (str "twitter" :> ?key) (str ?n-docs :> ?d-str)))) You can define this and any of the following functions in the REPL, or add them to core.clj in your project. If you want to use the REPL, simply use lein repl from within the project folder. The required namespace (the use statement), require, and import definitions can be found in the source code bundle. We then need to add similar functions to calculate the TF and DF values: (defn DF [src] (<- [?key ?df-count-str] (src ?doc-id ?time ?df-word) (c/distinct-count ?doc-id ?df-word :> ?df-count) (str ?df-word :> ?key) (str ?df-count :> ?df-count-str))) (defn TF [src] (<- [?key ?tf-count-str] (src ?doc-id ?time ?tf-word) (c/count ?tf-count) (str ?doc-id ?tf-word :> ?key) (str ?tf-count :> ?tf-count-str))) This Batch layer is only interested in calculating views for all the data leading up to, but not including, the current hour. This is because the data for the current hour will be provided by Trident when it merges this batch view with the view it has calculated. In order to achieve this, we need to filter out all the records that are within the current hour. The following function makes that possible: (deffilterop timing-correct? [doc-time] (let [now (local-now) interval (in-minutes (interval (from-long doc-time) now))] (if (< interval 60) false true)) Each of the preceding query definitions require a clean stream of words. The text contained in the source documents isn't clean. It still contains stop words. In order to filter these and emit a clean set of words for these queries, we can compose a function that splits the text into words and filters them based on a list of stop words and the time function defined previously: (defn etl-docs-gen [rain stop] (<- [?doc-id ?time ?word] (rain ?doc-id ?time ?line) (split ?line :> ?word-dirty) ((c/comp s/trim s/lower-case) ?word-dirty :> ?word) (stop ?word :> false) (timing-correct? ?time))) We will be storing the outputs from our queries to Cassandra, which requires us to define a set of taps for these views: (defn create-tap [rowkey cassandra-ip] (let [keyspace storm_keyspace column-family "tfidfbatch" scheme (CassandraScheme. cassandra-ip "9160" keyspace column-family rowkey {"cassandra.inputPartitioner""org.apache.cassandra.dht.RandomPartitioner" "cassandra.outputPartitioner" "org.apache.cassandra.dht.RandomPartitioner"}) tap (CassandraTap. scheme)] tap)) (defn create-d-tap [cassandra-ip] (create-tap "d"cassandra-ip)) (defn create-df-tap [cassandra-ip] (create-tap "df" cassandra-ip)) (defn create-tf-tap [cassandra-ip] (create-tap "tf" cassandra-ip)) The way this schema is created means that it will use a static row key and persist name-value pairs from the tuples as column:value within that row. This is congruent with the approach used by the Trident Cassandra adaptor. This is a convenient approach, as it will make our lives easier later. We can complete the implementation by a providing a function that ties everything together and executes the queries: (defn execute [in stop cassandra-ip] (cc/connect! cassandra-ip) (sch/set-keyspace storm_keyspace) (let [input (tap/hfs-tap (AvroScheme. (load-schema)) in) stop (hfs-delimited stop :skip-header? true) src (etl-docs-gen input stop)] (?- (create-d-tap cassandra-ip) (D src)) (?- (create-df-tap cassandra-ip) (DF src)) (?- (create-tf-tap cassandra-ip) (TF src)))) Next, we need to get some data to test with. I have created some test data, which is available at https://bitbucket.org/qanderson/tfidf-cascalog. Simply download the project and copy the contents of src/data to the data folder in your project structure. We can now test this entire implementation. To do this, we need to insert the data into Hadoop: hadoop fs -copyFromLocal ./data/document.avro data/document.avro hadoop fs -copyFromLocal ./data/en.stop data/en.stop Then launch the execution from the REPL: => (execute "data/document" "data/en.stop" "127.0.0.1") How it works… There are many excellent guides on the Cascalog wiki (https://github.com/nathanmarz/cascalog/wiki), but for completeness's sake, the nature of a Cascalog query will be explained here. Before that, however, a revision of Cascading pipelines is required. The following is quoted from the Cascading documentation (http://docs.cascading.org/cascading/2.1/userguide/htmlsingle/): Pipe assemblies define what work should be done against tuple streams, which are read from tap sources and written to tap sinks. The work performed on the data stream may include actions such as filtering, transforming, organizing, and calculating. Pipe assemblies may use multiple sources and multiple sinks, and may define splits, merges, and joins to manipulate the tuple streams. This concept is embodied in Cascalog through the definition of queries. A query takes a set of inputs and applies a list of predicates across the fields in each tuple of the input stream. Queries are composed through the application of many predicates. Queries can also be composed to form larger, more complex queries. In either event, these queries are reduced down into a Cascading pipeline. Cascalog therefore provides an extremely terse and powerful abstraction on top of Cascading; moreover, it enables an excellent development workflow through the REPL. Queries can be easily composed and executed against smaller representative datasets within the REPL, providing the idiomatic API and development workflow that makes Clojure beautiful. If we unpack the query we defined for TF, we will find the following code: (defn DF [src] (<- [?key ?df-count-str] (src ?doc-id ?time ?df-word) (c/distinct-count ?doc-id ?df-word :> ?df-count) (str ?df-word :> ?key) (str ?df-count :> ?df-count-str))) The <- macro defines a query, but does not execute it. The initial vector, [?key ?df-count-str], defines the output fields, which is followed by a list of predicate functions. Each predicate can be one of the following three types: Generators: A source of data where the underlying source is either a tap or another query. Operations: Implicit relations that take in input variables defined elsewhere and either act as a function that binds new variables or a filter. Operations typically act within the scope of a single tuple. Aggregators: Functions that act across tuples to create aggregate representations of data. For example, count and sum. The :> keyword is used to separate input variables from output variables. If no :> keyword is specified, the variables are considered as input variables for operations and output variables for generators and aggregators. The (src ?doc-id ?time ?df-word) predicate function names the first three values within the input tuple, whose names are applicable within the query scope. Therefore, if the tuple ("doc1" 123324 "This") arrives in this query, the variables would effectively bind as follows: ?doc-id: "doc1" ?time: 123324 ?df-word: "This" Each predicate within the scope of the query can use any bound value or add new bound variables to the scope of the query. The final set of bound values that are emitted is defined by the output vector. We defined three queries, each calculating a portion of the value required for the TF-IDF algorithm. These are fed from two single taps, which are files stored in the Hadoop filesystem. The document file is stored using Apache Avro, which provides a high-performance and dynamic serialization layer. Avro takes a record definition and enables serialization/deserialization based on it. The record structure, in this case, is for a document and is defined as follows: {"namespace": "storm.cookbook", "type": "record", "name": "Document", "fields": [ {"name": "docid", "type": "string"}, {"name": "time", "type": "long"}, {"name": "line", "type": "string"} ] } Both the stop words and documents are fed through an ETL function that emits a clean set of words that have been filtered. The words are derived by splitting the line field using a regular expression: (defmapcatop split [line] (s/split line #"[[](),.)s]+")) The ETL function is also a query, which serves as a source for our downstream queries, and defines the [?doc-id ?time ?word] output fields. The output tap, or sink, is based on the Cassandra scheme. A query defines predicate logic, not the source and destination of data. The sink ensures that the outputs of our queries are sent to Cassandra. The ?- macro executes a query, and it is only at execution time that a query is bound to its source and destination, again allowing for extreme levels of composition. The following, therefore, executes the TF query and outputs to Cassandra: (?- (create-tf-tap cassandra-ip) (TF src)) There's more… The Avro test data was created using the test data from the Cascading tutorial at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/. Within this tutorial is the rain.txt tab-separated data file. A new column was created called time that holds the Unix epoc time in milliseconds. The updated text file was then processed using some basic Java code that leverages Avro: Schema schema = Schema.parse(SandboxMain.class.getResourceAsStream("/document.avsc")); File file = new File("document.avro"); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); dataFileWriter.create(schema, file); BufferedReader reader = new BufferedReader(new InputStreamReader(SandboxMain.class.getResourceAsStream("/rain.txt"))); String line = null; try { while ((line = reader.readLine()) != null) { String[] tokens = line.split("t"); GenericRecord docEntry = new GenericData.Record(schema); docEntry.put("docid", tokens[0]); docEntry.put("time", Long.parseLong(tokens[1])); docEntry.put("line", tokens[2]); dataFileWriter.append(docEntry); } } catch (IOException e) { e.printStackTrace(); } dataFileWriter.close(); Persisting documents from Storm In the previous recipe, we looked at deriving precomputed views of our data taking some immutable data as the source. In that recipe, we used statically created data. In an operational system, we need Storm to store the immutable data into Hadoop so that it can be used in any preprocessing that is required. How to do it… As each tuple is processed in Storm, we must generate an Avro record based on the document record definition and append it to the data file within the Hadoop filesystem. We must create a Trident function that takes each document tuple and stores the associated Avro record. Within the tfidf-topology project created in, inside the storm.cookbook.tfidf.function package, create a new class named PersistDocumentFunction that extends BaseFunction. Within the prepare function, initialize the Avro schema and document writer: public void prepare(Map conf, TridentOperationContext context) { try { String path = (String) conf.get("DOCUMENT_PATH"); schema = Schema.parse(PersistDocumentFunction.class .getResourceAsStream("/document.avsc")); File file = new File(path); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); if(file.exists()) dataFileWriter.appendTo(file); else dataFileWriter.create(schema, file); } catch (IOException e) { throw new RuntimeException(e); } } As each tuple is received, coerce it into an Avro record and add it to the file: public void execute(TridentTuple tuple, TridentCollector collector) { GenericRecord docEntry = new GenericData.Record(schema); docEntry.put("docid", tuple.getStringByField("documentId")); docEntry.put("time", Time.currentTimeMillis()); docEntry.put("line", tuple.getStringByField("document")); try { dataFileWriter.append(docEntry); dataFileWriter.flush(); } catch (IOException e) { LOG.error("Error writing to document record: " + e); throw new RuntimeException(e); } } Next, edit the TermTopology.build topology and add the function to the document stream: documentStream.each(new Fields("documentId","document"), new PersistDocumentFunction(), new Fields()); Finally, include the document path into the topology configuration: conf.put("DOCUMENT_PATH", "document.avro"); How it works… There are various logical streams within the topology, and certainly the input for the topology is not in the appropriate state for the recipes in this article containing only URLs. We therefore need to select the correct stream from which to consume tuples, coerce these into Avro records, and serialize them into a file. The previous recipe will then periodically consume this file. Within the context of the topology definition, include the following code: Stream documentStream = getUrlStream(topology, spout) .each(new Fields("url"), new DocumentFetchFunction(mimeTypes), new Fields("document", "documentId", "source")); documentStream.each(new Fields("documentId","document"), new PersistDocumentFunction(), new Fields()); The function should consume tuples from the document stream whose tuples are populated with already fetched documents.

0
0
2862

article-image-sql-server-integration-services-ssis

Packt

03 Sep 2013

5 min read

SQL Server Integration Services (SSIS)

Packt

03 Sep 2013

5 min read

0
0
6962

Packt

03 Sep 2013

20 min read

Oracle APEX 4.2 reporting

Packt

03 Sep 2013

20 min read

(For more resources related to this topic, see here.) The objective of the first chapter is to quickly introduce you to the technology and then dive deep into the understanding the fabric of the tool. The chapter also helps you set the environment, which will be used throughout the book. Chapter 1, Know Your Horse Before You Ride It, starts with discussing the various features of APEX. This is to give heads up to the readers about the features offered by the tool and to inform them about some of the strengths of the tool. In order to understand the technology better, we discuss the various web server combinations possible with APEX, namely the Internal mod_plsql, External mod_plsql, and Listener configuration. While talking about the Internal mod_plsql configuration, we see the steps to enable the XMLDB HTTP server. In the Internal mod_plsql configuration, Oracle uses a DAD defined in EPG to talk to the database and the web server. So, we try to create a miniature APEX of our own, by creating our DAD using it to talk to the database and the web server. We then move on to learn about the External mod_plsql configuration. We discuss the architecture and the roles of the configuration files, such as dads.conf and httpd.conf. We also have a look at a typical dads.conf file and draw correlations between the configurations in the Internal and External mod_plsql configuration. We then move on to talk about the wwv_flow_epg_include_mod_local procedure that can help us use the DAD of APEX to call our own stored PL/SQL procedures. We then move on to talk about APEX Listener which is a JEE alternative to mod_plsql and is Oracle’s direction for the future. Once we are through with understanding the possible configurations, we see the steps to set up our environment. We use the APEX Listener configuration and see the steps to install the APEX engine, create a Weblogic domain, set the listener in the domain, and create an APEX workspace. With our environment in place, we straight away get into understanding the anatomy of APEX by analyzing various parts of its URL. This discussion includes a natter on the session management, request handling, debugging, error handling, use of TKPROF for tracing an APEX page execution, cache management and navigation, and value passing in APEX. We also try to understand the design behind the zero session ID in this section. Our discussions till now would have given you a brief idea about the technology, so we try to dig in a little deeper and understand the mechanism used by APEX to send web requests to its PL/SQL engine in the database by decoding the APEX page submission. We see the use of the wwv_flow.accept procedure and understand the role of page submission. We try to draw an analogy of an APEX form with a simple HTML to get a thorough understanding about the concept. The next logical thing after page submission is to see the SQL and PL/SQL queries and blocks reaching the database. We turn off the database auditing and see the OWA web toolkit requests flowing to the database as soon as we open an APEX page. We then broaden our vision by quickly knowing about some of the lesser known alternatives of mod_plsql. We end the chapter with a note of caution and try to understand the most valid criticisms of the technology, by understanding the SQL injection and Cross-site Scripting (XSS). After going through the architecture we straight away spring into action and begin the process of learning how to build the reports in APEX. The objective of this chapter is to help you understand and implement the most common reporting requirements along with introducing some interesting ways to frame the analytical queries in Oracle. The chapter also hugely focuses on the methods to implement different kinds of formatting in the APEX classic reports. We start Chapter 2, Reports, by creating the objects that will be used throughout the book and installing the reference application that contains the supporting code for the topics discussed in the second chapter. We start the chapter by setting up an authentication mechanism. We discuss external table authentication in this chapter. We then move on to see the mechanism of capturing the environment variables in APEX. These variables can help us set some logic related to a user’s session and environment. The variables also help us capture some properties of the underlying database session of an APEX session. We capture the variables using the USERENV namespace, DBMS_SESSION package, and owa_util package. After having a good idea of the ways and means to capture the environment variables, we build our understanding of developing the search functionality in a classic APEX report. This is mostly a talk about the APEX classic report features, we also use this opportunity to see the process to enable sorting in the report columns, and to create a link that helps us download the report in the CSV format. We then discuss various ways to implement the group reports in APEX. The discussion shows a way to implement this purely, by using the APEX’s feature, and then talks about getting the similar results by using the Oracle database feature. We talk about the APEX’s internal grouping feature and the Oracle grouping sets. The section also shows the first use of JavaScript to manipulate the report output. It also talks about a method to use the SQL query to create the necessary HTML, which can be used to display a data column in a classic report. We take the formatting discussion further, by talking about a few advanced ways of highlighting the report data using the APEX’s classic report features, and by editing the APEX templates. After trying our hand at formatting, we try to understand the mechanism to implement matrix reports in APEX. We use matrix reports to understand the use of the database features, such as the with clause, the pivot operator, and a number of string aggregation techniques in Oracle. The discussion on the string aggregation techniques includes the talk on the LISTAGG function, the wm_concat function, and the use of the hierarchical queries for this purpose. We also see the first use of the APEX items as the substitution variables in this book. We will see more use of the APEX items as the substitution variables to solve the vexed problems in other parts of the book as well. We do justice with the frontend as well, by talking about the use of jQuery, CSS, and APEX Dynamic Actions for making the important parts of data stand out. Then, we see the implementation of the handlers, such as this.affectedElements and the use of the jQuery functions, such as this.css() in this section. We also see the advanced formatting methods using the APEX templates. We then end this part of the discussion, by creating a matrix report using the APEX Dynamic Query report region. The hierarchical reports are always intriguing, because of the enormous ability to present the relations among different rows of data, and because of the use of the hierarchical queries in a number of unrelated places to answer the business queries. We reserve a discussion on the Oracle’s hierarchical queries for a later part of the section and start it, by understanding the implementation of the hierarchical reports by linking the values in APEX using drilldowns. We see the use of the APEX items as the substitution variable to create the dynamic messages. Since we have devised a mechanism to drill down, we should also build a ladder for the user to climb up the hierarchical chain. Now, the hierarchical chain for every person will be different depending on his position in the organization, so we build a mechanism to build the dynamic bread crumbs using the PL/SQL region in APEX. We then talk about two different methods of implementing the hierarchical queries in APEX. We talk about the connect, by clause first, and then continue our discussion by learning the use of the recursive with clause for the hierarchical reporting. We end our discussion on the hierarchical reporting, by talking about creating the APEX’s Tree region that displays the hierarchical data in the form of a tree. The reports are often associated with the supporting files that give more information about the business query. A typical user might want to upload a bunch of files while executing his piece of the task and the end user of his action might want to check out the files uploaded by him/her. To understand the implementation of this requirement, we check out the various ways to implement uploading and downloading the files in APEX. We start our discussion, by devising the process of uploading the files for the employees listed in the oehr_employees table. The solution of uploading the files involves the implementation of the dynamic action to capture the ID of the employee on whose row the user has clicked. It shows the use of the APEX items as the substitution variables for creating the dynamic labels and the use of JavaScript to feed one APEX item based on the another. We extensively talk about the implementation of jQuery in Dynamic Actions in this section as well. Finally, we check out the use of the APEX’s file browse item along with the WWV_FLOW_FILES table to capture the file uploaded by the user. Discussion on the methods to upload the files is immediately followed by talking about the ways to download these files in APEX. We nest the use of the functions, such as HTF.ANCHOR and APEX_UTIL.GET_BLOB_FILE_SRC for one of the ways to download a file, and also talk about the use of dbms_lob.getlength along with the APEX format mask for downloading the files. We then engineer our own stored procedure that can download a blob stored in the database as a file. We end this discussion, by having a look at the APEX’s p process, which can also be used for downloading. AJAX is the mantra of the new age and we try our hand at it, by implementing the soft deletion in APEX. We see the mechanism of refreshing just the report and not reloading the page as soon as the user clicks to delete a row. We use JavaScript along with the APEX page process and the APEX templates to achieve this objective. Slicing and dicing along with auditing are two of the most common requirements in the reporting world. We see the implementation of both of these using both the traditional method JavaScript with the page processes and the new method of using Dynamic Actions. We extend our use case a little further and learn about a two way interaction between the JavaScript function and the page process. We learn to pass the values back and forth between the two. While most business intelligence and reporting solutions are a one way road and are focused on the data presentation, Oracle APEX can go a step further, and can give an interface to the user for the data manipulations as well. To understand this strength of APEX, we have a look at the process of creating the tabular forms in APEX. We extend our understanding of the tabular forms to see a magical use of jQuery to convert a certain sections of a report from display-only to editable textboxes. We move our focus from implementing the interesting and tricky frontend requirements to framing the queries to display the complex data types. We use the string aggregation methods to display data in a column containing a varray. Time dimension is one of the most widely used dimension in reporting circles and comparing current performance with the past records is a favorite requirement of most businesses. With this in mind, we shift our focus to understand and implement the time series reports in APEX. We start our discussion by understanding the method to implement a report that shows the contribution of each business line in every quarter. We implement this by using partitioning dimensions on the fly. We also use the analytical functions, such as ratio_to_report, lead, and lag in the process of creating the time series reports. We use our understanding of time dimension to build a report that helps a user compare one time period to the other. The report gives the user the freedom to select the time segments, which he wishes to compare. We then outwit a limitation of this report, by using the query partition clause for the data densification. We bring our discussion on reports based on time dimension, by presenting a report based on the modal clause to you. The report serves as an example to show the enormous possibilities to code, by using the modal clause in Oracle. We bring our discussion on reports based on time dimension, by presenting a report based on the modal clause to you. The report serves as an example to show the enormous possibilities to code, by using the modal clause in Oracle. Chapter 3, In the APEX Mansion – Interactive Reports is all about Interactive Reports and the dynamic reporting. While the second chapter was more about the data presentation using the complex queries and the presentation methods, this chapter is about taking the presentation part a step ahead, by creating more visually appealing Interactive Reports. We start the discussion of this chapter, by talking about the ins and outs of Interactive Reports. We postmortem this feature of APEX to learn about every possible way of using Interactive Reports for making more sense of data. The chapter has a reference application of its own, which shows the code in action. We start our discussion, by exploring at various features of the Actions menu in Interactive Reports. The discussion is on search functionality, Select Columns feature, filtering, linking and filtering Interactive Reports using URLs, customizing Rows per page feature of an IR, using Control Break, creating computations in IR, creating charts in IR, using the Flashback feature and a method to see the back end flashback query, configuring the e-mail functionality for downloading a report, and for subscription of reports and methods to understand the download of reports in HTML, CSV, and PDF formats. Once we are through with understanding the Actions menu, we move on to understand the various configuration options in an IR. We talk about the Link section, the Icon View section, the Detail section, the Column Group section, and the Advanced section of the Report Attributes page of an IR. While discussing these sections, we understand the process of setting different default views of an IR for different users. Once we are through with our dissection of an IR, we put our knowledge to action, by inserting our own item in the Actions menu of an IR using Dynamic Actions and jQuery. We continue our quest for finding the newer formatting methods, by using different combinations of SQL, CSS, APEX templates, and jQuery to achieve unfathomable results. The objectives attained in this section include formatting a column of an IR based on another column, using CSS in the page header to format the APEX data, changing the font color of the alternate rows in APEX, using a user-defined CSS class in APEX, conditionally highlighting a column in IR using CSS and jQuery, and formatting an IR using the region query. After going through a number of examples on the use of CSS and jQuery in APEX, we lay down a process to use any kind of changes in an IR. We present this process by an example that changes one of the icons used in an IR to a different icon. APEX also has a number of views for IR, which can be used for intelligent programming. We talk about an example that uses the apex_application_page_ir_rpt view that show different IR reports on the user request. After a series of discussion on Interactive Reports, we move on to find a solution to an incurable problem of IR. We see a method to put multiple IR on the same page and link them as the master-child reports. We had seen an authentication mechanism (external table authentication) and data level authorization in Chapter 2, Reports. We use this chapter to see the object level authorization in APEX. We see the method to give different kinds of rights on the data of a column in a report to different user groups. After solving a number of problems and learning a number of things, we create a visual treat for ourselves. We create an Interactive report dashboard using Dynamic Actions. This dashboard presents different views of an IR as gadgets on a separate APEX page. We conclude this chapter, by looking at advanced ways of creating the dynamic reports in APEX. We look at the use of the table function in both native and interface approach, and we also look at the method to use the APEX collections for creating the dynamic IRs in APEX. Chapter 4, The Fairy Tale Begins – Advanced Reporting, is all about advanced reporting and pretty graphs. Since we are talking about advanced reporting, we see the process of setting the LDAP authentication in APEX. We also see the use of JXplorer to help us get the necessary DN for setting up the LDAP authentication. We also see the means to authenticate an LDAP user using PL/SQL in this section. This chapter also has a reference application of its own that shows the code in action. We start the reporting building in this chapter, by creating the Sparkline reports. This report uses jQuery for producing the charts. We then move on to use another jQuery library to create a report with slider. This report lets the user set the value of the salary of any employee using a slider. We then get into the world of the HTML charts. We start our talk, by looking at the various features of creating the HTML chart regions in APEX. We understand a method to implement the Top N and Bottom N chart reports in the HTML charts. We understand the APEX’s method to implement the HTML charts and use it to create an HTML chart on our own. We extend this technique of generating HTML from region source a little further, by using XMLTYPE to create the necessary HTML for displaying a report. We take our spaceship into a different galaxy of the charts world and see the use of Google Visualizations for creating the charts in APEX. We then switch back to AnyChart, which is a flash charting solution, and has been tightly integrated with APEX. It works on an XML and we talk about customizing this XML to produce different results. We put our knowledge to action, by creating a logarithmic chart and changing the style of a few display labels in APEX. We continue our discussion on AnyChart, and use the example of Doughnut Chart to understand advanced ways of using AnyChart in APEX. We use our knowledge to create Scatter chart, 3D stacked chart, Gauge chart, Gantt chart, Candlestick chart, Flash image maps, and SQL calendars. We move out of the pretty heaven of AnyChart only to get into another beautiful space of understanding the methods of displaying the reports with the images in APEX. We implement the reports with the images using the APEX’s format masks and also using HTF.IMG with the APEX_UTIL.GET_BLOB_FILE_SRC function. We then divert our attention to advanced jQuery uses such as, the creation of the Dialog box and the Context menu in APEX. We create a master-detail report using dialog boxes, where the child report is shown in the Dialog box. We close our discussion in this chapter, by talking about creating wizards in APEX and a method to show different kinds of customized error messages for problems appearing at different points of a page process. Chapter 5, Flight to Space Station: Advanced APEX, is all about advanced APEX. The topics discussed in this chapter fall into a niche category of reporting the implementation. This chapter has two reference applications of its own that shows the code in action. We start this chapter, by creating both the client-side and server-side image maps APEX. These maps are often used where regular shapes are involved. We then see the process of creating the PL/SQL Server Pages (PSPs). PSPs are similar to JSP and are used for putting the PL/SQL and HTML code in a single file. Loadpsp then converts this file into a stored procedure with the necessary calls to owa web toolkit functions. The next utility we check out is loadjava. This utility helps us load a Java class as a database object. This utility will be helpful, if some of the Java classes are required for code processing. We had seen the use of AnyChart in the previous chapter, and this chapter introduces you to FusionChart. We create a Funnel Chart using FusionChart in this chapter. We also use our knowledge of the PL/SQL region in APEX to create a Tag Cloud. We then stroll into the world of the APEX plugins. We understand the interface functions, which have to be created for plugins. We discuss the concepts and then use them to develop an Item type plugin and a Dynamic Action plugin. We understand the process of defining a Custom Attribute in a plugin, and then see the process of using it in APEX. We move on to learn about Websheets and their various features. We acquaint ourselves with the interface of the Websheet applications and understand the concept of sharing Websheets. We also create a few reports in a Websheet application and see the process of importing and using an image in Websheets. We then spend some time to learn about data grids in Websheets and the method to create them. We have a look at the Administration and View dropdowns in a Websheet application. We get into the administration mode and understand the process of configuring the sending of mails to Gmail server from APEX. We extend our administration skills further, by understanding the ways to download an APEX application using utilities, such as oracle.apex.APEXExport. Reporting and OLAP go hand in hand, so we see the method of using the OLAP cubes in APEX. We see the process of modeling a cube and understand the mechanism to use its powerful features. We then have a talk about Oracle Advanced Queues that can enable us to do reliable communication among different systems with different workloads in the enterprise, and gives improved performance. We spend a brief time to understand some of the other features of APEX, which might not be directly related to reporting, but are good to know. Some of these features include locking and unlocking of pages in APEX, the Database Object Dependencies report, Shortcuts, the Dataloading wizard, and the APEX views. We bring this exclusive chapter to an end, by discussing about the various packages that enable us to schedule the background jobs in APEX, and by discussing various other APEX and Database API, which can help us in the development process.

0
0
4216

Packt

03 Sep 2013

22 min read

Working with Time

Packt

03 Sep 2013

22 min read

0
0
3411

Packt

02 Sep 2013

10 min read

Memory and cache

Packt

02 Sep 2013

10 min read

(For more resources related to this topic, see here.) You can find this instruction in the OGlobalConfiguration.java file in the autoConfig() method. Furthermore, you can enable/disable level 1 cache, level 2 cache, or both. You can also set the number of records that will be stored in each level as follows: cache.level1.size: This sets the number of records to be stored in the level 1 caches (default -1, no limit) cache.level2.size: This sets the number of records to be stored in the level 2 cache (default -1, no limit) cache.level1.enabled: This is a boolean value, it enables/disables the level 1 cache (default, true) cache.level2.enabled: This is a boolean value, it enables/disables the level 2 cache (default, true) Mapping files OrientDB uses NIO to map data files in memory. However, you can change the way this mapping is performed. This is achieved by modifying the file access strategy. Mode 0: It uses the memory mapping for all the operations. Mode 1 (default): It uses the memory mapping, but new reads are performed only if there is enough memory, otherwise the regular Java NIO file read/write is used. Mode 2: It uses the memory mapping only if the data has been previously loaded. Mode 3: It uses memory mapping until there is space available, then use regular JAVA NIO file read/write. Mode 4: It disables all the memory mapping techniques. To set the strategy mode, you must use the file.mmap.strategy configuration property. Connections When you have to connect with a remote database you have some options to improve your application performance. You can use the connection pools, and define the timeout value to acquire a new connection. The pool has two attributes: minPool: It is the minimum number of opened connections maxPool: It is the maximum number of opened connections When the first connection is requested to the pool, a number of connections corresponding to the minPool attribute are opened against the server. If a thread requires a new connection, the requests are satisfied by using a connection from the pool. If all the connections are busy, a new one is created until the value of maxPool is reached. Then the thread will wait, so that a connection is freed. Minimum and maximum connections are defined by using the client.channel.minPool (default value 1) and client.channel.maxPool (default value 5) properties. However, you can override these values in the client code by using the setProperty() method of the connection class. For example: database = new ODatabaseDocumentTx("remote:localhost/demo");database.setProperty("minPool", 10);database.setProperty("maxPool", 50);database.open("admin", "admin"); You can also change the connection timeout values. In fact, you may experience some problem, if there are network latencies or if some server-side operations require more time to be performed. Generally these kinds of problems are shown in the logfile with warnings: WARNING: Connection re-acquired transparently after XXXms and Yretries: no errors will be thrown at application level You can try to change the network.lockTimeout and the network.socketTimeout values. The first one indicates the timeout in milliseconds to acquire a lock against a channel (default is 15000), the second one indicates the TCP/IP socket timeout in milliseconds (default is 10000). There are some other properties you can try to modify to resolve network issues. These are as follows: network.socketBufferSize: This is the TCP/IP socket buffer size in bytes (default 32 KB) network.retry: This indicates the number of retries a client should do to establish a connection against a server (default is 5) network.retryDelay: This indicates the number of milliseconds a client will wait before retrying to establish a new connection (default is 500) Transactions If your primary objective is the performance, avoid using transactions. However, if it is very important for you to have transactions to group operations, you can increase overall performance by disabling the transaction log. To do so just set the tx.useLog property to false. If you disable the transaction log, OrientDB cannot rollback operations in case JVM crashes. Other transaction parameters are as follows: tx.log.synch: It is a Boolean value. If set, OrientDB executes a synch against the filesystem for each transaction log entry. This slows down the transactions, but provides reliability on non- reliable devices. Default value is false. tx.commit.synch: It is a Boolean value. If set, it performs a storage synch after a commit. Default value is true. Massive insertions If you want to do a massive insertion, there are some tricks to speed up the operation. First of all, do it via Java API. This is the fastest way to communicate with OrientDB. Second, instruct the server about your intention: db.declareIntent( new OIntentMassiveInsert() );//your code here....db.declareIntent( null ); Here db is an opened database connection. By declaring the OIntentMassiveInsert() intent, you are instructing OrientDB to reconfigure itself (that is, it applies a set of preconfigured configuration values) because a massive insert operation will begin. During the massive insert, avoid creating a new ODocument instance for each record to insert. On the contrary, just create an instance the first time, and then clean it using the reset() method: ODocument doc = new ODocument();for(int i=0; i< 9999999; i++){doc.reset(); //here you will reset the ODocument instancedoc.setClassName("Author");doc.field("id", i);doc.field("name", "John");doc.save();} This trick works only in a non-transactional context. Finally, avoid transactions if you can. If you are using a graph database and you have to perform a massive insertion of vertices, you can still reset just one vertex: ODocument doc = db.createVertex();...doc.reset();... Moreover, since a graph database caches the most used elements, you may disable this: db.setRetainObjects(false); Datafile fragmentation Each time a record is updated or deleted, a hole is created in the datafiles structure. OrientDB tracks these holes and tries to reuse them. However, many updates and deletes can cause a fragmentation of datafiles, just like in a filesystem. To limit this problem, it is suggested to set the oversize attribute of the classes you create. The oversize attribute is used to allocate more space for records once they are created, so as to avoid defragmentation upon updates. The oversize attribute is a multiplying factor where 1.0 or 0 means no oversize. The default values are 0 for document, and 2 for vertices. OrientDB has a defrag algorithm that starts automatically when certain conditions are verified. You can set some of these conditions by using the following configuration parameter: file.defrag.holeMaxDistance: It defines the maximum distance in bytes between two holes that triggers the defrag procedure. The default is 32 KB, -1 means dynamic size. The dynamic size is computed in the ODataLocal class in the getCloserHole() method, as Math.max(32768 * (int) (size / 10000000), 32768), where size is the current size of the file. The profiler OrientDB has an embedded profiler that you can use to analyze the behavior of the server. The configuration parameters that act on the profiler are as follows: profiler.enabled: This is a boolean value (enable/disable the profiler), the default value is false. profiler.autoDump.interval: It is the number of seconds between profiler dump. The default value is 0, which means no dump. profiler.autoDump.reset: This is a boolean value, reset the profile at every dump. The default is true. The dump is a JSON string structured in sections. The first one is a huge collection of information gathered at runtime related to the configuration and resources used by each object in the database. The keys are structured as follows: db.<db-name>: They are database-related metrics db.<db-name>.cache: They are metrics about databases' caching db.<db-name>.data: They are metrics about databases' datafiles, mainly data holes db.<db-name>.index: They are metrics about databases' indexes system.disk: They are filesystem-related metrics system.memory: They are RAM-related metrics system.config.cpus: They are the number of the cores process.network: They are network metrics process.runtime: They provide process runtime information and metrics server.connections.actives: They are number of active connections The second part of the dump is a collection of chronos. A chrono is a log of an operation, for example, a create operation, an update operation, and so on. Each chrono has the following attributes: last: It is the last time recorded min: It is the minimum time recorded max: It is the maximum time recorded average: It is the average time recorded total: It is the total time recorded entries: It is the number of times the specific metric has been recorded Finally, there are sections about many counters. Query tips In the following paragraphs some useful information on how to optimize the queries execution is given. The explain command You can see how OrientDB accesses the data by using the explain command in the console. To use this command simply write explain followed by the select statement: orientdb> explain select from Posts A set of key-value pairs are returned. Keys mean the following: resultType: It is the type of the returned resultset. It can be collection, document, or number. resultSize: It is the number of records retrieved if the resultType is collection. recordReads: It is the number of records read from datafiles. involvedIndexes: They are the indices involved in the query. indexReads: It is the number of records read from the indices. documentReads: They are the documents read from the datafiles. This number could be different from recordReads, because in a scanned cluster there can be different kinds of records. documentAnalyzedCompatibleClass: They are the documents analyzed belonging to the class requested by the query. This number could be different from documentReads, because a cluster may contain several different classes. elapsed: This time is measured in nanoseconds, it is the time elapsed to execute the statement. As you can see, OrientDB can use indices to speed up the reads. Indexes You can define indexes as we do in a relational database using the create index statement or via Java API using the createIndex() method of the OClass class: create index <class>.<property> [unique|notunique|fulltext] [field type] Or for composite index (an index on more than one property): create index <index_name> on <class> (<field1>,<field2>)[unique|notunique|fulltext] If you create a composite index, OrientDB will use it also when in a where clause you don't specify a criteria against all the indexed fields. So you can avoid this to build an index for each field you use in the queries if you have already built a composite one. This is the case of a partial match search and further information about it can be found in the OrientDB wiki at https://github.com/nuvolabase/orientdb/wiki/Indexes#partial-match-search. Generally, the indexes don't work with the like operator. If you want to perform the following query: select from Authors where name like 'j%' And you want use an index, you must define on the field name a FULLTEXT index. FULLTEXT indices permit to index string fields. However keep in mind that indices slow down the insert, update, and delete operations. Summary In this article we have seen some strategies that try to optimize both the OrientDB server installation and queries. Resources for Article: Further resources on this subject: Comparative Study of NoSQL Products [Article] Connecting to Microsoft SQL Server Compact 3.5 with Visual Studio [Article] Microsoft SQL Azure Tools [Article]

0
0
1973

Packt

02 Sep 2013

6 min read

OAuth Authentication

Packt

02 Sep 2013

6 min read

(For more resources related to this topic, see here.) Understanding OAuth OAuth has the concept of Providers and Clients. An OAuth Provider is like a SAML Identity Provider, and is the place where the user enters their authentication credentials. Typical OAuth Providers include Facebook and Google. OAuth Clients are resources that want to protect resources, such as a SAML Service Provider. If you have ever been to a site that has asked you to log in using your Twitter or LinkedIn credentials then odds are that site was using OAuth. The advantage of OAuth is that a user’s authentication credentials (username and password, for instance) is never passed to the OAuth Client, just a range of tokens that the Client requested from the Provider and which are authorized by the user. OpenAM can act as both an OAuth Provider and an OAuth Client. This chapter will focus on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. Preparing Facebook as an OAuth Provider Head to https://developers.facebook.com/apps/ and create a Facebook App. Once this is created, your Facebook App will have an App ID and an App Secret. We’ll use these later on when configuring OpenAM. Facebook won’t let a redirect to a URL (such as our OpenAM installation) without being aware of the URL. The steps for preparing Facebook as an OAuth provider are as follows: Under the settings for the App in the section Website with Facebook Login we need to add a Site URL. This is a special OpenAM OAuth Proxy URL, which for me was http://openam.kenning.co.nz:8080/openam/oauth2c/OAuthProxy.jsp as shown in the following screenshot: Click on the Save Changes button on Facebook. My OpenAM installation for this chapter was directly available on the Internet just in case Facebook checked for a valid URL destination. Configuring an OAuth authentication module OpenAM has the concept of authentication modules, which support different ways of authentication, such as OAuth, or against its Data Store, or LDAP or a Web Service. We need to create a new Module Instance for our Facebook OAuth Client. Log in to OpenAM console. Click on the Access Control tab, and click on the link to the realm / (Top Level Realm). Click on the Authentication tab and scroll down to the Module Instances section. Click on the New button. Enter a name for the New Module Instance and select OAuth 2.0 as the Type and click on the OK button. I used the name Facebook. You will then see a screen as shown: For Client Id, use the App ID value provided from Facebook. For the Client Secret use the App Secret value provided from Facebook as shown in the preceding screenshot. Since we’re using Facebook as our OAuth Provider, we can leave the Authentication Endpoint URL, Access Token Endpoint URL, and User Profile Service URL values as their default values. Scope defines the permissions we’re requesting from the OAuth Provider on behalf of the user. These values will be provided by the OAuth Provider, but we’ll use the default values of email and read_stream as shown in the preceding screenshot. Proxy URL is the URL we copied to Facebook as the Site URL. This needs to be replaced with your OpenAM installation value. The Account Mapper Configuration allows you to map values from your OAuth Provider to values that OpenAM recognizes. For instance, Facebook calls emails email while OpenAM references values from the directory it is connected to, such as mail in the case of the embedded LDAP server. This goes the same for the Attribute Mapper Configuration. We’ll leave all these sections as their defaults as shown in the preceding screenshot. OpenAM allows attributes passed from the OAuth Provider to be saved to the OpenAM session. We’ll make sure this option is Enabled as shown in the preceding screenshot. When a user authenticates against an OAuth Provider, they are likely to not already have an account with OpenAM. If they do not have a valid OpenAM account then they will not be allowed access to resources protected by OpenAM. We should make sure that the option to Create account if it does not exist is Enabled as shown in the preceding screenshot. Forcing authentication against particular authentication modules In the writing of this book I disabled the Create account if it does not exist option while I was testing. Then when I tried to log into OpenAM I was redirected to Facebook, which then passed my credentials to OpenAM. Since there was no valid OpenAM account that matched my Facebook credentials I could not log in. For your own testing, it would be recommended to use http://openam.kenning.co.nz:8080/openam/UI/Login?module=Facebook rather than changing your authentication chain. Thankfully, you can force a login using a particular authentication module by adjusting the login URL. By using http://openam.kenning.co.nz:8080/openam/UI/Login?module=DataStore, I was able to use the Data Store rather than OAuth authentication module, and log in successfully. For our newly created accounts we can choose to prompt the user to create a password and enter an activation code. For our prototype we’ll leave this option as Disabled. The flip side to Single Sign On is Single Log Out. Your OAuth Provider should provide a logout URL which we could possibly call to log out a user when they log out of OpenAM. The options we have when a user logs out of OpenAM is to either not log them out of the OAuth Provider, to log them out of the OAuth Provider, or to ask the user. If we had set earlier that we wanted to enforce password and activation token policies, then we would need to enter details of an SMTP server, which would be used to email the activation token to the user. For the purposes of our prototype we’ll leave all these options blank. Click on the Save button. Summary This article served as a quick primer on what OAuth is and how to achieve it with OpenAM. It covered the concept of using Facebook as an OAuth provider and configuring an OAuth module. It focused on using OpenAM as an OAuth Client and using Facebook as an OAuth Provider. This would really help when we might want to allow authentication against Facebook or Google. Resources for Article: Further resources on this subject: Getting Started with OpenSSO [Article] OpenAM: Oracle DSEE and Multiple Data Stores [Article] OpenAM Identity Stores: Types, Supported Types, Caching and Notification [Article]

0
0
41186

article-image-working-import-process-intermediate

Packt

30 Aug 2013

5 min read

Working with Import Process (Intermediate)

Packt

30 Aug 2013

5 min read

(For more resources related to this topic, see here.) Getting ready The first thing we need to do is to download the latest version of Sqoop from following location http://www.apache.org/dist/sqoop/ and extract it on your machine. Now I am calling the Sqoop installation dir as $SQOOP_HOME. Given here are the prerequisites for Sqoop import process. Installed and running Relational Database Management System (MySQL). Installed and running Hadoop Cluster. Set $HADOOP_HOME environment variable. Following are the common arguments of import process. Parameters Description --connect <jdbc-uri> This command specifies the server or database to connect. It also specifies the port. Example: --connect jdbc:mysql://host:port/databaseName --connection-manager <class-name> Specify connection manager class name. --driver <class-name> Specify the fully qualified name of JDBC driver class. --password <password> Set authentication password required to connect to input source. --username <username> Set authentication username. How to do it Let’s see how to work with import process First, we will start with import single RDBMS table into Hadoop. Query1: $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table tableName --target-dir /user/abc/tableName The content of output file in HDFS will look like: Next, we will put some light on approach of import only selected rows and selected columns of RDBMS table into Hadoop. Query 2: Import selected columns $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --columns “student_id,address,name” Query 3: Import selected rows. $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --where ‘student_id<100’ Query 4: Import selected columns of selected rows. $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --columns “student_id,address,name” -- where ‘student_id<100’ How it works… Now let’s see how the above steps work: Import single table: Apart from the common arguments of import process, as explained previously, this part covers some other arguments which are required to import a table into Hadoop Distributed File System. Parameters Description --table <table-name> Name of input table to fetch. --target-dir<dir> Location of output/target dir in HDFS. --direct If user want to use non-JDBC based access mechanism for faster database access --options-file <file-path> All the command line options that are common in most of commands can put in options file for convenience. The Query1 will run a MapReduce job and import all the rows of given table to HDFS (where, /user/abc/tableName is the location of output files). The records imported in HDFS preserve their original columns order, which means, if input table contains four columns A, B, C and D, then content in HDFS file will look like: A1, B1, C1, D1 A2, B2, C2, D2 Import selected columns: By default, the import query will select all columns of input table for import, but we can select the subset of columns by specifying the comma separated list of columns in --columns argument. The Query2 will only fetch three columns (student_id, address and name) of student table. If import query contains the --columns argument, then the order of column in output files are same as order specified in --columns argument. The output in HDFS will look like: student_id, address, name 1, Delhi, XYZ 2, Mumbai, PQR .......... If the input query contains the column in following order -- “address, name, student_id”, then the output in HDFS will look like. address, name, student_id Delhi, XYZ, 1 Mumbai, PQR, 2 ............. Import selected rows: By default, all the rows of input table will be imported to HDFS, but we can control which rows need to be import by using a --where argument in the import statement. The Query3 will import only those rows into HDFS which has value of “student_id” column greater than 100. The Query4 use both --columns and --where arguments in one statement. For Query4, Sqoop will internally generates the query of the form “select student_id, address, name from student where student_id<100”. There’s more... This section covers some more examples of import process. Import all tables: So far we have imported a single table into HDFS, this section introduces an import-all-tables tool, by which we can import a set of tables from an RDBMS to HDFS. The import-all-tables tool creates a separate directory in HDFS for each RDBMS table. The following are the mandatory conditions for import-all-tables tool: All tables must have a single primary key column. User must intend to import all the columns of each table. No --where, --columns and --query arguments are permitted. Example: Query 5: $ bin/sqoop import-all-tables --connect jdbc:mysql://localhost:3306/db1 --username root --password password This query will import all tables (tableName and tableName1) of database db1 into HDFS. Output directories in HDFS look like: Summary We learned a lot in this article, about import single RDBMS table into HDFS, import selected columns and selected rows, and import set of RDBMS tables. Resources for Article : Further resources on this subject: Introduction to Logging in Tomcat 7 [Article] Configuring Apache and Nginx [Article] Geronimo Architecture: Part 2 [Article]

0
0
1247

Packt

30 Aug 2013

13 min read

Managing a Hadoop Cluster

Packt

30 Aug 2013

13 min read

0
0
16010

article-image-what-are-ssas-2012-dimensions-and-cube

Packt

29 Aug 2013

7 min read

What are SSAS 2012 dimensions and cube?

Packt

29 Aug 2013

7 min read

(For more resources related to this topic, see here.) What is SSAS? SQL Server Analysis Services is an online analytical processing tool that highly boosts the different types of SQL queries and calculations that are accepted in the business intelligence environment. It looks like a relation database, but it has differences. SSAS does not replace the requirement of relational databases, but if you combine the two, it would help to develop the business intelligence solutions. Why do we need SSAS? SSAS provide a very clear graphical interface for the end users to build queries. It is a kind of cache that we can use to speed up reporting. In most real scenarios where SSAS is used, there is a full copy of the data in the data warehouse. All reporting and analytic queries are run against SSAS rather than against the relational database. Today's modern relational databases include many features specifically aimed at BI reporting. SSAS are database services specifically designed for this type of workload, and in most cases it has achieved much better query performance. SSAS 2012 architecture In this article we will explain about the architecture of SSAS. The first and most important point to make about SSAS 2012 is that it is really two products in one package. It has had a few advancements relating to performance, scalability, and manageability. This new version of SSAS that closely resembles PowerPivot uses the tabular model. When installing SSAS, we must select either the tabular model or multidimensional model for installing an instance that runs inside the server; both data models are developed under the same code but sometimes both are treated separately. The concepts included in designing both data models are different, and we can't turn a tabular database into a multidimensional database, or vice versa without rebuilding everything from the start. The main point of view of the end users is that both data models do almost the same things and appear almost equally when used through a client tool such as Excel. The tabular model A concept of building a database using the tabular model is very similar to building it in a relational database. An instance of Analysis Services can hold many databases, and each database can be looked upon as a self-contained collection of objects and data relating to a single business solution. If we are writing reports or analyzing data and we find that we need to run queries on multiple databases, we probably have made a design mistake somewhere because everything we need should be contained within an individual database. Tabular models are designed by using SQL Server Data Tools (SSDT), and a data project in SSDT mapping onto a database in Analysis Services. The multidimensional model This data model is very similar to the tabular model. Data is managed in databases, and databases are designed in SSDT, which are in turn managed by using SQL Server Management Studio. The differences may become similar below the database level, where the multidimensional data model rather than relational concepts are accepted. In the multidimensional model, data is modeled as a series of cubes and dimensions and not tables. The future of Analysis Services We have two data models inside SSAS, along with two query and calculation languages; it is clearly not an ideal state of affairs. It means we have to select a data model to use at the start of our project, when we might not even know enough about our need to gauge which one is appropriate. It also means that anyone who decides to specialize in SSAS has to learn two technologies. Microsoft has very clearly said that the multidimensional model is not scrapped and that the tabular model is not its replacement. It is just like saying that the new advanced features for the multidimensional data model will be released in future versions of SSAS. The fact that the tabular and multidimensional data models share some of the same code suggests that some new features could easily be developed for both models simultaneously. What's new in SSAS 2012? As we know, there is no easy way of transferring a multidimensional data model into a tabular data model. We may have many tools in the market that claim to make this transition with a few mouse clicks, but such tools could only ever work for very simple multidimensional data models and would not save much development time. Therefore, if we already have a mature multidimensional implementation and the in-house skills to develop and maintain it, we may find the following improvements in SSAS 2012 useful. Ease of use If we are starting an SSAS 2012 project with no previous multidimensional or OLAP experience, it is very likely that we will find a tabular model much easier to learn than a multidimensional one. Not only are the concepts much easier to understand, especially if we are used to working with relational databases, but also the development process is much more straightforward and there are far fewer features to learn. Compatibility with PowerPivot The tabular data model and PowerPivot are the same in the way their models are designed. The user interfaces used are practically the same, as both the interfaces use DAX. PowerPivot models can be imported into SQL Server Data Tools to generate a tabular model, although the process does not work the other way around, and a tabular model cannot be converted to a PowerPivot model. Processing performance characteristics If we compare the processing performance of the multidimensional and tabular data models, that will become difficult. It may be slower to process a large table following the tabular data model than the equivalent measure group in a multidimensional one because a tabular data model can't process partitions in the same table at the same time, whereas a multidimensional model can process partitions in the same measure group at the same time. What is SSAS dimension? A database dimension is a collection of related objects; in other words, attributes; they provide the information about fact data in one or more cubes. Typical attributes in a product dimension are product name, product category, line, size, and price. Attributes can be organized into user-defined hierarchies that provide the paths to assist users when they browse through the data in a cube. By default these attributes are visible as attribute hierarchies, and they can be used to understand fact data in a cube. What is SSAS cube? A cube is a multidimensional structure that contains information for analytical purposes; the main constituents of a cube are dimensions and measures. Dimensions define the structure of a cube that you use to slice and dice over, and measures provide the aggregated numerical values of interest to the end user. As a logical structure, a cube allows a client application to retrieve values—of measures—as if they are contained in cells in the cube. The cells are defined for every possible summarized value. A cell, in the cube, is defined by the intersection of dimension members and contains the aggregated values of the measures at that specific intersection. Summary We talked about the special new features and services present, what you can do with them, and why they’re so great. Resources for Article: Further resources on this subject: Creating an Analysis Services Cube with Visual Studio 2008 - Part 1 [Article] Performing Common MDX-related Tasks [Article] How to Perform Iteration on Sets in MDX [Article]

0
0
4453

How-To Tutorials - Data

Recursive queries

Regression models in Weka

Out-of-process distributed caching

Using indexes to manipulate pandas objects

Quick start – writing your first MDX query

Understanding Point-In-Time-Recovery

Integrating Storm and Hadoop

SQL Server Integration Services (SSIS)

Oracle APEX 4.2 reporting

Working with Time

Trending Topics

Memory and cache

OAuth Authentication

Working with Import Process (Intermediate)

Managing a Hadoop Cluster

What are SSAS 2012 dimensions and cube?

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access