0

Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Free Learning

How-To Tutorials - Data

1210 Articles

article-image-time-dimension-oracle-warehouse-builder-11g

4 min read

Time Dimension in Oracle Warehouse Builder 11g

Let's discuss briefly what a Time dimension is, and then we'll dive right into the Warehouse Builder Design Center and create one. A Time dimension is a key part of most data warehouses. It provides the time series information to describe our data. A key feature of data warehouses is being able to analyze data from several time periods and compare results between them. The Time dimension is what provides us the means to retrieve data by time period. Do not be confused by the use of the word Time to refer to this dimension. In this case, it does not refer to the time of day but to time in general which can span days, weeks, months, and so on. We are using it because the Warehouse Builder uses the word Time for this type of dimension to signify a time period. So when referring to a Time dimension here, we will be talking about our time period dimension that we will be using to store the date. We will give the name Date to be clear about what information it contains. Every dimension, whether time or not, has four characteristics that have to be defined in OWB: Levels Dimension Attributes Level Attributes Hierarchies The Levels are for defining the levels where aggregations will occur, or to which data can be summed. We must have at least two levels in our Time dimension. While reporting on data from our data warehouse, users will want to see totals summed up by certain time periods such as per day, per month, or per year. These become the levels. A multidimensional implementation includes metadata to enable aggregations automatically at those levels, if we use the OLAP feature. The relational implementation can make use of those levels in queries to sum the data. The Warehouse Builder has the following Levels available for the Time dimension: Day Fiscal week Calendar week Fiscal month Calendar month Fiscal quarter Calendar quarter Fiscal year Calendar year The Dimension Attributes are individual pieces of information we're going to store in the dimension that can be found at more than one level. Each level will have an ID that identifies that level, a start and an end date for the time period represented at that level, a time span that indicates the number of days in the period, and a description of the level. Each level has Level Attributes associated with it that provide descriptive information about the value in that level. The dimension attributes found at that level and additional attributes specific to the level are included. For example, if we're talking about the Month level, we will find attributes that describe the value for the month such as the month of the year it represents, or the month in the calendar quarter. These would be numbers indicating which month of the year or which month of the quarter it is. The Oracle Warehouse Builder Users' Guide contains a more complete list of all the attributes that are available. OWB tracks which of these attributes are applicable to which level and allows the setting of a separate description that identifies the attribute for that level. Toward the end of the chapter, when we look at the Data Object Editor, we'll see the feature provided by the Warehouse Builder to view details about objects such as dimensions and cubes. We must also define at least one Hierarchy for our Time dimension. A hierarchy is a structure in our dimension that is composed of certain levels in order; there can be one or more hierarchies in a dimension. Calendar month, calendar quarter, and calendar year can be a hierarchy. We could view our data at each of these levels, and the next level up would simply be a summation of all the lower-level data within that period. A calendar quarter sum would be the sum of all the values in the calendar month level in that quarter, and the multidimensional implementation includes the metadata to facilitate these kinds of calculations. This is one of the strengths of a multidimensional implementation. The good news is that the Warehouse Builder contains a wizard that will do all the work for us—create our Time dimension and define the above four characteristics—just by asking us a few questions.

0
0
2234

article-image-indexing-data-solr-14-enterprise-search-server-part2

9 min read

Indexing Data in Solr 1.4 Enterprise Search Server: Part2

(For more resources on Solr, see here.) Direct database and XML import The capability for Solr to get data directly from a database or HTTP GET accessible XML is distributed with Solr as a contrib module, and it is known as the DataImportHandler (DIH in short). The complete reference documentation for this capability is here at http://wiki.apache.org/solr/DataImportHandler, and it's rather thorough. In this article, we'll only walk through an example to see how it can be used with the MusicBrainz data set. In short, the DIH offers the following capabilities: Imports data from databases through JDBC (Java Database Connectivity) Imports XML data from a URL (HTTP GET) or a file Can combine data from different tables or sources in various ways Extraction/Transformation of the data Import of updated (delta) data from a database, assuming a last-updated date A diagnostic/development web page Extensible to support alternative data sources and transformation steps As the MusicBrainz data is in a database, the most direct method to get data into Solr is definitely through the DIH using JDBC. Getting started with DIH DIH is not a direct part of Solr. Hence it might not be included in your Solr setup. It amounts to a JAR file named something like apache-solr-dataimporthandler-1.4.jar, which is probably already embedded within the solr.war file. You can use jar -tf solr.war to see. Alternatively, it may be placed in <solr-home>/lib, which is alongside the conf directory we've been working with. For database connectivity, we need to ensure that the JDBC driver is on the Java classpath. Placing it in <solr-home>/lib is a convenient way to do this. The DIH needs to be registered with Solr in solrconfig.xml. Here is how it is done: <requestHandler name="/dataimport"class="org.apache.solr.handler.dataimport.DataImportHandler"><lst name="defaults"><str name="config">mb-dih-artists-jdbc.xml</str></lst></requestHandler> mb-dih-artists-jdbc.xml (mb being short for MusicBrainz) is a file in <solr-home>/conf, which is used to configure DIH. It is possible to specify some configuration aspects in this request handler configuration instead of the dedicated configuration file. However, I recommend that it all be in the DIHconfig file, as in our example here. Given below is an mb-dih-artists-jdbc.xml file with a rather long SQL query: <dataConfig> <dataSource name="jdbc" driver="org.postgresql.Driver" url="jdbc:postgresql://localhost/musicbrainz_db" user="musicbrainz" readOnly="true" autoCommit="false" /> <document> <entity name="artist" dataSource="jdbc" pk="id" query=" select a.id as id, a.name as a_name, a.sortname as a_name_sort, a.begindate as a_begin_date, a.enddate as a_end_date, a.type as a_type ,array_to_string( array(select aa.name from artistalias aa where aa.ref = a.id ) , '|') as a_alias ,array_to_string( array(select am.name from v_artist_members am where am.band = a.id order by am.id) , '|') as a_member_name ,array_to_string( array(select am.id from v_artist_members am where am.band = a.id order by am.id) , '|') as a_member_id, (select re.releasedate from release re inner join album r on re.album = r.id where r.artist = a.id order by releasedate desc limit 1) as a_release_date_latest from artist a " transformer="RegexTransformer,DateFormatTransformer, TemplateTransformer"> <field column = "id" template="Artist:${artist.id}" /> <field column = "type" template="Artist" /> <field column = "a_begin_date" dateTimeFormat="yyyy-MM-dd" /> <field column = "a_end_date" dateTimeFormat="yyyy-MM-dd" /> <field column = "a_alias" splitBy="|" /> <field column = "a_member_name" splitBy="|"/> <field column = "a_member_id" splitBy="|" /> </entity> </document></dataConfig> The DIH development console Before describing the configuration details, we're going to take a look at the DIH development console. It is accessed by going to this URL (modifications may be needed for your host, port, core, and so on):http://localhost:8983/solr/admin/dataimport.jsp The development console looks like the following screenshot: The screen is divided into two panes: on the left is the DIH control form, which includes an editable version of the DIH configuration file and on the right is the command output as raw XML. The screen works quite simply. The form essentially results in submitting a URL to the right pane. There's no real server-side logic to this interface beyond the standard DIH command invocations being executed on the right. The last section on DIH in this article goes into more detail on submitting a command to the DIH. DIH DataSources of type JdbcDataSource The DIH configuration file starts with the declaration of one or more data sources using the element <dataSource/>, which refers to either a database, a file, or an HTTP URL, depending on the type attribute. It defaults to a value of JdbcDataSource. Those familiar with JDBC should find the driver and url attributes with accompanying user and password straightforward—consult the documentation for your driver/database for further information. readOnly is a boolean that will set a variety of other JDBC options appropriately when set to true. And batchSize is an alias for the JDBC fetchSize and defaults to 500. There are numerous JDBC oriented attributes that can be set as well. I would not normally recommend learning about a feature by reading source code, but this is an exception. For further information, read org.apache.solr.handler.dataimport.JdbcDataSource.java Efficient JDBC configuration Many database drivers in the default configurations (including those for PostgreSQL and MySQL) fetch all of the query results into the memory instead of on-demand or using a batch/fetch size. This may work well for typical database usage like OLTP (Online Transaction Processing systems), but is completely unworkable for ETL (Extract Transform and Load) usage such as this. Configuring the driver to stream the data requires driver-specific configuration options. You may need to consult relevant documentation for the JDBC driver. For PostgreSQL, set autoCommit to false. For MySQL, set batchSize to -1(The DIH detects the -1 and replaces it with Java's Integer.MIN_VALUE, which triggers the special behavior in MySQL's JDBC driver). For Microsoft SQL Server, set responseBuffering to adaptive. Further information about specific databases is at :http://wiki.apache.org/solr/DataImportHandlerFaq.. DIH documents, entities After the declaration of <dataSource/> element(s) is the <document/> element. In turn, this element contains one or more <entity/> elements. In this sample configuration, we're only getting artists. However, if we wanted to have more than one type in the same index, then another could be added. The dataSource attribute references a correspondingly named element earlier. It is only necessary if there are multiple to choose from, but we've put it here explicitly anyway. The main piece of an entity used with a JDBC data source is the query attribute, which is the SQL query to be evaluated. You'll notice that this query involves some sub-queries, which are made into arrays and then transformed into strings joined by spaces. The particular functions used to do these sorts of things are generally database specific. This is done to shoe-horn multi-valued data into a single row in the results. It may create a more complicated query, but it does mean that the database does all of the heavy lifting so that all of the data Solr needs for an artist is in the row. An alternative with DIH is to declare other entities within the entity. If you aren't using a database or if you wish to mix in another data source (even if it's of a different type), then you will be forced to do that. See the Solr DIH Wiki page for examples: http://wiki.apache.org/solr/DataImportHandler. The DIH also supports a delta query, which is a query that selects time-stamped data with dates after the last queried date. This won't be covered here, but you can find more information at the previous URL. DIH fields and transformers Within the <entity/> are some <field/>elements that declare how the columns in the query map to Solr. The field element must have a column attribute that matches the corresponding named column in the SQL query. The name attribute is the Solr schema field name that the column is going into. If it is not specified (and it never is for our example), then it defaults to the column name. Use the SQL as a keyword as we've done to use the same names as the Solr schema instead of the database schema. This reduces the number of explicit mappings needed in <field/> elements and shortens existing ones. When a column in the result can be placed directly into Solr without further processing, there is no need to specify the field declaration, because it is implied. An attribute of the entity declaration that we didn't mention yet is transformer. This declares a comma-separated list of transformers that manipulate the transfer of data from the JDBC resultset into a Solr field. These transformers evaluate a field, if it has an attribute it uses to do its job. More than one might operate on a given field. Therefore, the order in which the transformers are declared in matters. Here are the attributes we've used: template: It is used by TemplateTransformer and declares text, which might include variable name substitutions using ${name} syntax. To access an existing field, use the entityname.columnname syntax. splitBy: It is used by RegexTransformer and splits a single string value into a multi-value by looking for the specified character. dateTimeFormat: It is used by DateFormatTransformer. This is a Java date/time format pattern http://java.sun.com/j2se/1.5.0/docs/api/java/text/SimpleDateFormat.html). If the type of the field in the schema is a date, then it is necessary to ensure Solr can interpret the format. Alternatively, ensure that the string matches the ISO-8601 format, which looks like this: 1976-10-23T23:59:59.000Z. As in all cases in Solr, when specifying dates you can use its so-called "DateMath" syntax such as appending /DAY to tell Solr to round the date to a day.

0
0
2936

article-image-faceting-solr-14-enterprise-search-server

9 min read

Faceting in Solr 1.4 Enterprise Search Server

(For more resources on Solr, see here.) Faceting, after searching, is arguably the second-most valuable feature in Solr. It is perhaps even the most fun you'll have, because you will learn more about your data than with any other feature. Faceting enhances search results with aggregated information over all of the documents found in the search to answer questions such as the ones mentioned below, given a search on MusicBrainz releases: How many are official, bootleg, or promotional? What were the top five most common countries in which the releases occurred? Over the past ten years, how many were released in each year? How many have names in these ranges: A-C, D-F, G-I, and so on? Given a track search, how many are < 2 minutes long, 2-3, 3-4, or more? Moreover, in addition, it can power term-suggest aka auto-complete functionality, which enables your search application to suggest a completed word that the user is typing, which is based on the most commonly occurring words starting with what they have already typed. So if a user started typing siamese dr, then Solr might suggest that dreams is the most likely word, along with other alternatives. Faceting, sometimes referred to as faceted navigation, is usually used to power user interfaces that display this summary information with clickable links that apply Solr filter queries to a subsequent search. If we revisit the comparison of search technology to databases, then faceting is more or less analogous to SQL's group by feature on a column with count(*). However, in Solr, facet processing is performed subsequent to an existing search as part of a single request-response with both the primary search results and the faceting results coming back together. In SQL, you would need to potentially perform a series of separate queries to get the same information. A quick example: Faceting release types Observe the following search results. echoParams is set to explicit (defined in solrconfig.xml) so that the search parameters are seen here. This example is using the standard handler (though perhaps dismax is more typical). The query parameter q is *:*, which matches all documents. In this case, the index I'm using only has releases. If there were non-releases in the index, then I would add a filter fq=type%3ARelease to the URL or put this in the handler configuration, as that is the data set we'll be using for most of this article. I wanted to keep this example brief so I set rows to 2. Sometimes when using faceting, you only want the facet information and not the main search, so you would set rows to 0, if that is the case. It's important to understand that the faceting numbers are computed over the entire search result, which is all of the releases in this example, and not just the two rows being returned. <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">160</int> <lst name="params"> <str name="wt">standard</str> <str name="rows">2</str> <str name="facet">true</str> <str name="q">*:*</str> <str name="fl">*,score</str> <str name="qt">standard</str> <str name="facet.field">r_official</str> <str name="f.r_official.facet.missing">true</str> <str name="f.r_official.facet.method">enum</str> <str name="indent">on</str> </lst> </lst> <result name="response" numFound="603090" start="0" maxScore="1.0"> <doc> <float name="score">1.0</float> <str name="id">Release:136192</str> <str name="r_a_id">3143</str> <str name="r_a_name">Janis Joplin</str> <arr name="r_attributes"><int>0</int><int>9</int> <int>100</int></arr> <str name="r_name">Texas International Pop Festival 11-30-69</str> <int name="r_tracks">7</int> <str name="type">Release</str> </doc> <doc> <float name="score">1.0</float> <str name="id">Release:133202</str> <str name="r_a_id">6774</str> <str name="r_a_name">The Dubliners</str> <arr name="r_attributes"><int>0</int></arr> <str name="r_lang">English</str> <str name="r_name">40 Jahre</str> <int name="r_tracks">20</int> <str name="type">Release</str> </doc> </result> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="r_official"> <int name="Official">519168</int> <int name="Bootleg">19559</int> <int name="Promotion">16562</int> <int name="Pseudo-Release">2819</int> <int>44982</int> </lst> </lst> <lst name="facet_dates"/> </lst> </response> The facet related search parameters are highlighted at the top. The facet.missing parameter was set using the field-specific syntax, which will be explained shortly. Notice that the facet results (highlighted) follow the main search result and are given a name facet_counts. In this example, we only faceted on one field, r_official, but you'll learn in a bit that you can facet on as many fields as you desire. The name attribute holds a facet value, which is simply an indexed term, and the integer following it is the number of documents in the search results containing that term, aka a facet count. The next section gives us an explanation of where r_official and r_type came from. MusicBrainz schema changes In order to get better self-explanatory faceting results out of the r_attributes field and to split its dual-meaning, I modified the schema and added some text analysis. r_attributes is an array of numeric constants, which signify various types of releases and it's official-ness, for lack of a better word. As it represents two different things, I created two new fields: r_type and r_official with copyField directives to copy r_attributes into them: <field name="r_attributes" type="integer" multiValued="true" indexed="false" /> <field name="r_type" type="rType" multiValued="true" stored="false" /> <field name="r_official" type="rOfficial" multiValued="true" stored="false" /> And: <copyField source="r_attributes" dest="r_type" /> <copyField source="r_attributes" dest="r_official" /> In order to map the constants to human-readable definitions, I created two field types: rType and rOfficial that use a regular expression to pull out the desired numbers and a synonym list to map from the constant to the human readable definition. Conveniently, the constants for r_type are in the range 1-11, whereas r_official are 100-103. I removed the constant 0, as it seemed to be bogus. <fieldType name="rType" class="solr.TextField" sortMissingLast="true" omitNorms="true"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="^(0|1dd)$" replacement="" replace="first" /> <filter class="solr.LengthFilterFactory" min="1" max="100" /> <filter class="solr.SynonymFilterFactory" synonyms="mb_attributes.txt" ignoreCase="false" expand="false"/> </analyzer> </fieldType> The definition of the type rOfficial is the same as rType, except it has this regular expression: ^(0|dd?)$. The presence of LengthFilterFactory is to ensure that no zero-length (empty-string) terms get indexed. Otherwise, this would happen because the previous regular expression reduces text fitting unwanted patterns to empty strings. The content of mb_attributes.txt is as follows: # from: http://bugs.musicbrainz.org/browser/mb_server/trunk/ # cgi-bin/MusicBrainz/Server/Release.pm#L48 #note: non-album track seems bogus; almost everything has it 0=>Non-Album Track 1=>Album 2=>Single 3=>EP 4=>Compilation 5=>Soundtrack 6=>Spokenword 7=>Interview 8=>Audiobook 9=>Live 10=>Remix 11=>Other 100=>Official 101=>Promotion 102=>Bootleg 103=>Pseudo-Release It does not matter if the user interface uses the name (for example: Official) or constant (for example: 100) when applying filter queries when implementing faceted navigation, as the text analysis will let the names through and will map the constants to the names. This is not necessarily true in a general case, but it is for the text analysis as I've configured it above. The approach I took was relatively simple, but it is not the only way to do it. Alternatively, I might have split the attributes and/or mapped them as part of the import process. This would allow me to remove the multiValued setting in r_official. Moreover, it wasn't truly necessary to map the numbers to their names, as a user interface, which is going to present the data, could very well map it on the fly. Field requirements The principal requirement of a field that will be faceted on is that it must be indexed. In addition to all but the prefix faceting use case, you will also want to use text analysis that does not tokenize the text. For example, the value Non-Album Track is indexed the way it is in r_type. We need to be careful to escape the space where this appeared in mb_attributes.txt. Otherwise, faceting on this field would show tallies for Non-Album and Track separately. Depending on the type of faceting you want to do and other needs you have like sorting, you will often find it necessary to have a copy of a field just for faceting. Remember that with faceting, the facet values returned in search results are the actual terms indexed, and not the stored value, which isn't even used.

0
0
4285

article-image-adding-reporting-capabilities-our-java-applications-using-jasperreports-35

12 min read

Adding Reporting Capabilities to our Java Applications Using JasperReports 3.5

At the end of this article, we will be able to: Identify the purpose of the several downloads that can be found at the JasperReports web site Set up our development and execution environment to successfully add reporting capabilities to our Java applications Identify the required libraries for adding reporting capabilities to Java applications Identify the optional libraries that can be used to enhance the reporting capabilities of our Java applications Downloading JasperReports JasperReports is distributed as a JAR file that needs to be added to the CLASSPATH of any application we wish to add reporting capabilities to. JasperReports can be downloaded from http://jasperforge.org/plugins/project/project_home.php?group_id=102. Clicking on the Download link around the center of the page will take us to the JasperReports download page on SourceForge.net. The specific version on your screen might be different; at the time of writing, the latest stable version of JasperReports is 3.5.2. It is not always clear what exactly is downloaded by clicking on these links; thus, we will provide a brief explanation of what each link is for. jasperreports-3.5.2-applet.jar: This first download link is for a JAR file containing a subset of the JasperReports functionality. Specifically, it contains classes that can be used to display JasperPrint files, which are reports in JasperReports' native format. This file is offered as a separate download as a convenience for developers; it can be used for applications or applets that don't need full reporting capabilities, yet need to display generated reports. Even though the filename has a suffix of "applet", there is nothing preventing us from using it with standalone applications, without the overhead that the missing JasperReports classes would add to the download. This file is approximately 346 KB in size. jasperreports-3.5.2.jar: This is the second download link, and it is the complete JasperReports class library. It contains all classes necessary to compile, fill, and export reports, but does not include any additional libraries that JasperReports depends on. This is the minimum file requirement for us to add full reporting capabilities to our Java applications. However, if we choose to download this file, we need to download the JasperReports dependencies separately. This file is approximately 2.2 MB in size. jasperreports-3.5.2-javaflow.jar: This is the third download link, and it is the javaflow version of JasperReports. This version of JasperReports should be used in environments where multithreading is discouraged. This file is approximately 2.2 MB in size. jasperreports-3.5.2-project.tar.gz: This is the fourth download link, and it contains the complete JasperReports class library plus all the required and optional libraries. In addition, it contains the entire file as a gzipped TAR file, which is common in Unix and Unix-like systems, such as Linux. This download also includes the JasperReports source code and a lot of source code providing examples of JasperReports' functionality. This gzipped TAR file is approximately 42 MB in size jasperreports-3.5.2-project.zip: The fifth download link, like the fourth link, contains the complete JasperReports class library plus all the required and optional libraries, along with the JasperReports source code. However, it contains the files in ZIP format, which is more commonly used under Microsoft Windows. This file is approximately 51 MB in size. Unless Internet connection speed is an issue, we recommend downloading one of the last two mentioned files, as they include everything we need to create reports with JasperReports. Another good reason to download one of these files is that the included examples are a great way to learn how to implement the different JasperReports features. All of the examples in the file come with an ANT build file containing targets to compile and execute them. We will refer to this file as the JasperReports project file, or more succinctly, as the project file. Once we have downloaded the appropriate file for our purposes, we need to set up our environment to be able to start creating reports. In the next section, we discuss how to do this, assuming that the project file was downloaded. Setting up our environment To set up our environment to get ready for creating reports, we need to extract the JasperReports project ZIP file to a location of our choice. Once we extract the project ZIP file, we should see a jasperreports-3.5.2 directory (The actual name of the directory may vary slightly depending on the version of JasperReports.) containing the following files and directories: build: This directory contains the compiled JasperReports class files. build.xml: This is an ANT build file, which builds the JasperReports source code. If we don't intend to modify JasperReports, we don't need to use this file as JasperReports is distributed in compiled form. changes.txt: This file explains the differences between the current and previous versions of the JasperReports' class library. demo: This directory contains various examples demonstrating several aspects of JasperReports functionality. dist: This directory contains a JAR file containing the standard, javaflow, and applet versions of the JasperReports library. We should add one of these JAR files to our CLASSPATH to take advantage of JasperReports functionality. docs: This directory contains a quick reference guide to most XML tags used in JasperReports templates. lib: This directory contains all the libraries needed to build JasperReports and to use it in our applications. license.txt: This file contains the full text of the LGPL license. pom.xml: This is a Maven 2 POM file used to build JasperReports with Maven, just like build.xml. We don't need this file because JasperReports is distributed in compiled form. readme.txt: This file contains instructions on how to build and execute the supplied examples. src: This directory contains the JasperReports source code. Getting up and running quicklyTo get up and run quickly, the files to add to the CLASSPATH are the JasperReports JAR files, and all the JAR files under the lib directory in the project ZIP file. By adding these files to the CLASSPATH, we don't have to worry about the CLASSPATH when implementing additional functionality, for example, when exporting to PDF or producing charts. JasperReports class library For all JasperReports-related tasks, we need to add the JasperReports library to our CLASSPATH. The JasperReports library can be found under the dist directory in the project file; it is named jasperreports-3.5.2.jar. Depending on the version of JasperReports, the filename will vary. Required libraries for report compilation The project file described earlier contains all of the required supporting libraries. Once that file is downloaded, all the required libraries can be found under the lib subdirectory of the directory created when extracting the ZIP file. JasperReports uses these required files for XML parsing. Therefore, they are needed when compiling JRXML files programmatically, not for filling or displaying reports. JRXML files can be compiled using a custom ANT task provided by JasperReports. If we choose to compile our JRXML files using this custom ANT target, these required libraries need to be added to the CLASSPATH variable of the ANT build file. There are example build files included in the project file and also at http://www.packtpub.com/files/code/8082_Code.zip. The following discussion about libraries highlights the fact that JasperReports makes extensive use of Apache Commons. Apache Commons Digester The Commons Digester library includes utility classes used to initialize Java objects from XML files. JasperReports takes advantage of the Digester component of the Apache Commons repository to implement its XML parsing functionality. Version 3.5.2 of the JasperReports project ZIP file includes version 1.7 of Commons Digester. The filename is commons-digester-1.7.jar, and it must be on your CLASSPATH for your JasperReports application to work correctly. If you downloaded the bare JasperReports class library, you will need to download Commons Digester separately from http://commons.apache.org/digester/. Apache Commons Collections Another component of the Apache Commons suite is the Collections component. This component provides functionality to complement and augment the Java Collections framework. JasperReports takes advantage of the Collections component of Apache Commons to implement some of its functionality. Like all required libraries included in the project ZIP file, the Commons Collections library can be found under the lib subdirectory of the directory created when extracting the project ZIP file. JasperReports project file version 3.5.2 includes version 2.1 of Commons Collections, distributed as a JAR file named commons-collections-2.1.jar. If you have downloaded the bare JasperReports class library, you will need to download Commons Collections separately from http://commons.apache.org/collections/. Apache Commons Logging Apache Commons Logging is a component of the Apache Commons that provides components to aid developers with sending data to log files. JasperReports takes advantage of this component, which can be found on the lib directory of the project ZIP file. The version included with JasperReports 3.5.2 is Commons Logging 1.0.2. The file to be added to your CLASSPATH is commons-logging-1.0.2.jar. If you have downloaded the bare JasperReports class library, you will need to download Commons Logging separately from http://commons.apache.org/logging/. Apache Commons BeanUtils The last library that JasperReports requires for compiling reports is Apache Commons BeanUtils. BeanUtils is a library that provides easy-to-use wrappers around the Java reflection and introspection APIs. Version 3.5.2 of the JasperReports project ZIP file includes BeanUtils 1.8; the file to add to your CLASSPATH is commons-beanutils-1.8.jar. If you have downloaded the bare JasperReports class library, you will need to download Commons BeanUtils separately from http://commons.apache.org/beanutils/. Optional libraries and tools There are a number of libraries that are required only if we wish to take advantage of some of JasperReports' features. These optional libraries and their uses are listed next. Apache ANT JasperReports comes bundled with some custom ANT targets for previewing report designs and for viewing reports serialized in JasperReports' native format. Although not strictly necessary, it is very helpful to have ANT available to take advantage of these custom targets. ANT can be downloaded from http://ant.apache.org/. JDT compiler JDT (Java Development Tools) compiler is the Java compiler included with the Eclipse IDE. The JDT compiler is needed only if the JasperReports application is running under a Java Runtime Environment (JRE) and not under a full JDK. When creating reports, JasperReports creates temporary Java files and compiles them. When using a JDK, JasperReports takes advantage of tools.jar for this functionality. As a JRE does not include tools.jar, the JDT compiler is needed. The JasperReports project file version 3.5.2 includes version 3.1.1 of the JDT compiler. It can be found under the lib directory of the directory created when extracting the project ZIP file. The file to add to your CLASSPATH is jdt-compiler-3.1.1.jar. This file cannot be downloaded separately; therefore, if we need to execute our code under a JRE. We need to download the JasperReports project ZIP file because it includes this file needed for report compilation. JDBC driver When using a JDBC datasource, which is the most common datasource for JasperReports generated reports, the appropriate JDBC driver for our specific RDBMS is needed. The following table lists popular relational database systems and the required JAR files to add to the CLASSPATH. The exact filenames may vary depending on the version, target JDK, and supported JDBC version. The filenames shown here reflect the latest stable versions targeted to the latest available JDK and the latest available version of JDBC at the time of writing. RDBMS Driver JAR Files Firebird jaybird-2.1.6.jar HSQLDB hsqldb.jar JavaDB/Derby (included with JDK 1.6+) derby.jar (embedded) derbyclient.jar (network) MySQL mysql-connector-java-5.1.7-bin.jar Oracle ojdbc6.jar PostgreSQL postgresql-8.3-604.jdbc4.jar SQL Server sqljdbc_1.2.2828.100_enu.exe (for Windows systems) sqljdbc_1.2.2828.100_enu.tar.gz (for Unix systems) Sybase Jconnect60.zip The JasperReports project file includes the JDBC driver for HSQLDB. Consult your RDBMS documentation for information on where to download the appropriate JDBC driver for your RDBMS. iText iText is an open source library for creating and manipulating PDF files. It is needed in our CLASSPATH only if we want to export our reports to PDF or RTF format. Version 3.5.2 of the JasperReports project file includes iText version 2.1.0; the file to add to your CLASSPATH is iText-2.1.0.jar. The iText library can be downloaded separately from http://www.lowagie.com/iText/. JFreeChart JFreeChart is an open source library for creating professional looking charts, including 2D and 3D pie charts, 2D and 3D bar charts, and line charts. It is needed in our CLASSPATH only if we intend to add charts to our reports. JFreeChart 1.0.12 can be found on the lib directory inside the JasperReports project file version 3.5.2. The file to add to the CLASSPATH is jfreechart-1.0.12.jar. JFreeChart can be downloaded separately from http://www.jfree.org/jfreechart/. JExcelApi JExcelApi is a Java library that allows Java applications to read, write, and modify Microsoft Excel files. We need JExcelApi in our CLASSPATH only if we need to export our reports to XLS format. JasperReports 3.5.2 includes JExcelApi version 2.6. To add XLS exporting capabilities to our reports, the file we need to add to our CLASSPATH is jxl-2.6.jar. JExcelApi can be downloaded separately from http://jexcelapi.sourceforge.net/. Summary This article covered the required and optional libraries needed to add reporting capabilities to Java applications. All the libraries covered in this article are needed at both compile time and runtime. The article provided an explanation of the different files available for downloading on JasperReports' web site and in which conditions it is appropriate to use them. We also saw which libraries are required for report compilation under a JDK and the additional libraries required when compiling JRXML templates under a JRE. Besides, we also learned which libraries are required when using JDBC datasources for our reports, and finally the libraries required when exporting our reports to several formats.

0
0
6585

article-image-designing-target-structure-oracle-warehouse-builder-11g

6 min read

Designing the Target Structure in Oracle Warehouse Builder 11g

We have our entire source structures defined in the Warehouse Builder. But before we can do anything with them, we need to design what our target data warehouse structure is going to look like. When we have that figured out, we can start mapping data from the source to the target. So, let's design our target structure. First, we're going to take a look at some design topics related to a data warehouse that are different from what we would use if we were designing a regular relational database. Data Warehouse Design When it comes to the design of a data warehouse, there is basically one option that makes the most sense for how we will structure our database and that is the dimensional model. This is a way of looking at the data from a business perspective that makes the data simple, understandable, and easy to query for the business end user. It doesn't require a database administrator to be able to retrieve data from it. We know the normalized method of modelling a database. A normalized model removes redundancies in data by storing information in discrete tables, and then referencing those tables when needed. This has an advantage for a transactional system because information needs to be entered at only one place in the database, without duplicating any information already entered. For example, in the ACME Toys and Gizmos transactional database, each time a transaction is recorded for the sale of an item at a register, a record needs to be added only to the transactions table. In the table, all details regarding the information to identify the register, the item information, and the employee who processed the transaction do not need to be entered because that information is already stored in separate tables. The main transaction record just needs to be entered with references to all that other information. This works extremely well for a transactional type of system concerned with daily operational processing where the focus is on getting data into the system. However, it does not work well for a data warehouse whose focus is on getting data out of the system. Users do not want to navigate through the spider web of tables that compose a normalized database model to extract the information they need. Therefore, dimensional models were introduced to provide the end user with a flattened structure of easily queried tables that he or she can understand from a business perspective. Dimensional Design A dimensional model takes the business rules of our organization and represents them in the database in a more understandable way. A business manager looking at sales data is naturally going to think more along the lines of "how many gizmos did I sell last month in all stores in the south and how does that compare to how many I sold in the same month last year?" Managers just want to know what the result is, and don't want to worry about how many tables need to be joined in a complex query to get that result. A dimensional model removes the complexity and represents the data in a way that end users can relate to it more easily from a business perspective. Users can intuitively think of the data for the above question as a cube, and the edges (or dimensions) of the cube labeled as stores, products, and time frame. So let's take a look at this concept of a cube with dimensions, and how we can use that to represent our data. Cube and Dimensions The dimensions become the business characteristics about the sales, for example: A time dimension—users can look back in time and check various time periods A store dimension—information can be retrieved by store and location A product dimension—various products for sale can be broken out Think of the dimensions as the edges of a cube, and the intersection of the dimensions as the measure we are interested in for that particular combination of time, store, and product. A picture is worth a thousand words, so let's look at what we're talking about in the following image: Notice what this cube looks like. How about a Rubik's Cube? We're doing a data warehouse for a toy store company, so we ought to know what a Rubik's cube is! If you have one, maybe you should go get it now because that will exactly model what we're talking about. Think of the width of the cube, or a row going across, as the product dimension. Every piece of information or measure in the same row refers to the same product, so there are as many rows in the cube as there are products. Think of the height of the cube, or a column going up and down, as the store dimension. Every piece of information in a column represents one single store, so there are as many columns as there are stores. Finally, think of the depth of the cube as the time dimension, so any piece of information in the rows and columns at the same depth represent the same point in time. The intersection of each of these three dimensions locates a single individual cube in the big cube, and that represents the measure amount we're interested in. In this case, it's dollar sales for a single product in a single store at a single point in time. But one might wonder if we are restricted to just three dimensions with this model. After all, a cube has only three dimensions—length, width, and depth. Well, the answer is no. We can have many more dimensions than just three. In our ACME example, we might want to know the sales each employee has accomplished for the day. This would mean we would need a fourth dimension for employees. But what about our visualization above using a cube? How is this fourth dimension going to be modelled? And no, the answer is not that we're entering the Twilight Zone here with that "dimension not only of sight and sound but of mind..." We can think of additional dimensions as being cubes within a cube. If we think of an individual intersection of the three dimensions of the cube as being another cube, we can see that we've just opened up another three dimensions to use—the three for that inner cube. The Rubik's Cube example used above is good because it is literally a cube of cubes and illustrates exactly what we're talking about. We do not need to model additional cubes. The concept of cubes within cubes was just to provide a way to visualize further dimensions. We just model our main cube, add as many dimensions as we need to describe the measures, and leave it for the implementation to handle. This is a very intuitive way for users to look at the design of the data warehouse. When it's implemented in a database, it becomes easy for users to query the information from it.

0
0
2193

article-image-synchronizing-objects-oracle-warehouse-builder-2

5 min read

Synchronizing Objects in Oracle Warehouse Builder

Synchronizing objects We created tables, dimensions, and a cube; and new tables were automatically created for each dimension and cube. We then created mappings to map data from tables to tables, dimensions, and a cube. What happens if, let's say for example, a table definition is updated after we've defined it and created a mapping or mappings that include it? What if a dimensional object is changed? In that case, what happens to the underlying table? This is what we are going to discuss in this section. One set of changes that we'll frequently find ourselves making is changes to the data we've defined for our data warehouse. We may get some new requirements that lead us to capture a new data element that we have not captured yet. We'll need to update our staging table to store it and our staging mapping to load it. Our dimension mapping(s) will need to be updated to store the new data element along with the underlying table. We could make manual edits to all the affected objects in our project, but the Warehouse Builder provides us some features to make that easier. Changes to tables Let's start the discussion by looking at table updates. If we have a new data element that needs to be captured, it will mean finding out where that data resides in our source system and updating the associated table definition in our module for that source system. Updating object definitions There are a couple of ways to update table definitions. Our choice will depend on how the table was defined in the Warehouse Builder in the first place. The two options are: It could be a table in a source database system, in which case the table was physically created in the source database and we just imported the table definition into the Warehouse Builder. It could be a table we defined in our project in the Warehouse Builder and then deployed to the target database to create it. Our staging table would be an example of this second option. In the first case, we can re-import the source table using the procedures generally used for importing source metadata. When re-importing tables, the Warehouse Builder will do a reconciliation process to update the already imported table with any changes it detects in the source table. For the second case, we can manually edit the table definition in our project to reflect the new data element. For the first case where the table is in a source database system, the action we choose also depends on whether that source table definition is in an Oracle database or a third-party database. If it is in a third-party database, we're going to encounter an error. Hence, we'll be forced to make manual edits to our metadata for that source until that bug is fixed. If the table is in an Oracle database, re-importing the table definition would not be a problem and it will do the reconciliation process, picking up any new data elements or changes to the existing ones. For a hands-on example here, let's turn to our new project that we created earlier while discussing snapshots. We copied our POS_TRANS_STAGE table over to this project, so let's use that table as an example of a changing table, as we defined the table structure manually in the Warehouse Builder Design Center and then deployed it to the target database to actually create it. For this example, we won't actually re-deploy it because we'll be using that second project we created. It doesn't have a valid location defined, but we can still edit the table definition and investigate how to reconcile that edit in the next section. So, let's edit the POS_TRANS_STAGE table in the ACME_PROJ_FOR_COPYING project in the Design Center by double-clicking on it to launch it in the Data Object Editor. We'll just add a column called STORE_AREA_SIZE to the table for storing the size of the store in square feet or square meters. We'll click on the Columns tab, scroll it all the way to the end, enter the name of the column, then select NUMBER for the data type, and leave the precision and scale to the default (that is 0) for this example. We can validate and generate the object without having a valid location defined, so we'll do that. The validation and generation should complete successfully; and if we look at the script, we'll see the new column included. We now need a mapping that uses that table, which we have back in our original project. Let's use the copy and paste technique we used earlier to copy the STAGE_MAP mapping over to this new project. We'll open the ACME_DW_PROJECT project, answering Save to the prompt to save or revert. Then on the STAGE_MAP mapping entry, we'll select Copy from the pop-up menu. We'll open the ACME_PROJ_FOR_COPYING project and then on the Mappings node, select Paste on the pop-up menu. We ordinarily won't copy an object and paste it into a whole new project just for making changes. We're only doing it here so that we can make changes without worrying about interfering with a working project.

0
0
3347

article-image-mapping-oracle-warehouse-database-2

5 min read

Mapping in Oracle Warehouse Database

In this article, we will begin to see the real power and flexibility the Warehouse Builder provides us for loading a data warehouse. When we complete the mappings in this article, we will have a complete collection of objects and mappings. We can deploy and run these to build and load our data warehouse. The basic procedure to build a mapping is the same—start with adding a source and a target, and then include any operators in between needed for data flow and transformation. Let's start this article with the STORE dimension and we'll see some new operators that are involved in transformations.Let's begin by creating a new mapping called STORE_MAP. In the Design Center, we will right-click on the Mappings node of the ACME_DW_PROJECT | Databases | Oracle | ACME_DWH database and select New.... Enter STORE_MAP for the name of the mapping and we will be presented with a blank Mapping Editor window. In this window, we will begin designing our mapping to load data into the STORE dimension. Adding source and target operators Once the POS_TRANS_STAGE staging table is loaded with data, this can be used to load data into our dimensions and cube. We'll now use this POS_TRANS_STAGE table as our source table. Let's drag this table onto the mapping from the Explorer window. The target for this mapping is going to be the STORE dimension, so we'll drag this dimension from Databases | Oracle | ACME_DWH | Dimensions onto the mapping and drop it to the right of the POS_TRANS_STAGE table operator. Remember that we build our mappings from the left to the right, with source on the left and target on the right. We'll be sure to leave some space between the two because we'll be filling that in with some more operators as we proceed. Now that we have our source and target included, let's take a moment to consider the data elements we're going to need for our target and where to get them from the source. Our target for this mapping, the STORE dimension, has the following attributes for the STORE level for which we'll need to have source data: NAME STORE_NUMBER ADDRESS1 ADDRESS2 CITY STATE ZIP_POSTALCODE COUNTY REGION_NAME For the REGION level, we'll need data for the following attributes: NAME DESCRIPTION COUNTRY_NAME For the COUNTRY level, we'll need data for the following attributes: NAME DESCRIPTION The complete and fully expanded STORE dimension in our mapping appears like the following screenshot: We might be tempted to include the ID fields in the above list of data elements for populating, but these are the attributes that will be filled in automatically by the Warehouse Builder. The Warehouse Builder fills them using the sequence that was automatically created for us when we built the dimension. We don't have to be concerned with connecting any source data to them. Now that we know what we need to populate in our STORE dimension, let's turn our attention over to the POS_TRANS_STAGE dimension for the candidate data elements that we can use. In this table, we see the following data elements for populating data in our STORE dimension: STORE_NAME STORE_NUMBER STORE_ADDRESS1 STORE_ADDRESS2 STORE_CITY STORE_STATE STORE_ZIPPOSTALCODE STORE_REGION STORE_COUNTRY It is easy to see which of these attributes will be used to map data to attributes in the STORE level of the STORE dimension. They will map into the corresponding attributes in the dimension in the STORE group. We'll need to connect the following attributes together: STORE_NAME to NAME STORE_NUMBER to STORE_NUMBER STORE_ADDRESS1 to ADDRESS1 STORE_ADDRESS2 to ADDRESS2 STORE_CITY to CITY STORE_STATE to STATE STORE_ZIPPOSTALCODE to ZIP_POSTALCODE STORE_REGION to REGION_NAME There is another attribute in our STORE dimension that we haven't accounted for yet—the COUNTY attribute. We don't have an input attribute to provide direct information about it. It is a special case that we will handle after we take care of these more straightforward attributes and will involve the lookup table that we discussed earlier in the introduction of this article. We're not going to directly connect the attributes mentioned in the list by just dragging a line between each of them. There are some issues with the source data that we are going to have to account for in our mapping. Connecting the attributes directly like that would mean the data would be loaded into the dimension as is, but we have investigated the source data and discovered that much of the source data contains trailing blanks due to the way the transactional system stores it. Some of the fields should be made all uppercase for consistency. Given this additional information, we'll summarize the issues with each of the fields that need to be corrected before loading into the target and then we'll see how to implement the necessary transformations in the mapping to correct them: STORE_NAME, STORE_NUMBER: We need to trim spaces and change these attributes to uppercase to facilitate queries as they are part of the business identifier STORE_ADDRESS1, ADDRESS2, CITY, STATE, and ZIP_POSTALCODE: We need to trim spaces and change the STATE attribute to uppercase STORE_REGION: We need to trim spaces and change this attribute to uppercase All of these needs can be satisfied and we can have the desired effect by applying pre-existing SQL functions to the data via Transformation Operators.

0
0
2464

article-image-configuring-mysql-linked-server-sql-server-2008

Packt Editorial Staff

7 min read

Configuring a MySQL linked server on SQL Server 2008

Packt Editorial Staff

Linking servers provides an elegant solution when you are faced with running queries against databases on distributed servers or looking at your distributed assets on disparate databases. This article by Dr. Jay Krishnaswamy explains how to set up a MySQL linked server on SQL Server 2008 Enterprise. Configuring a linked MySQL server as well as querying a table on the MySQL linked server is described. The reader would benefit reviewing the first article on this series on MySQL Servers. Introduction MS SQL servers always provided remote access to servers through RPC mechanisms, but they had the limitation of being confined to invoking procedures on remote locations. A linked server (a virtual server) may be considered a more flexible way of achieving the same thing, with the added benefits of remote table access and distributed queries. Microsoft manages the link mechanism via OLE DB technology. Specifically, an OLE DB data source points to the specific database that can be accessed using OLEDB. In this article, we will be creating a MySQL linked server on SQL Server 2008 and querying a database [TestMove] table shown in the next listing. In reviewing the previous article it may be noticed that the Employees tables were moved to MySQL database TestMove. In running the commands from the mysql> prompt it is assumed that the MySQL Server has been started. Listing 1: employees table in TestMove mysql> show tables; +--------------------+ | Tables_in_testmove | +--------------------+ | employees | +--------------------+ 1 row in set (0.09 sec) mysql> Creating an ODBC DSN for MySQL In the previous article on MySQL Servers cited earlier, a DSN was created for moving data. Essentially the same DSN can be used. Herein follows a brief review of the DSN MySQL_Link created along the same lines as in the previously referenced article. The ODBC driver used for creating this ODBC DSN is the one installed on the machine when the MySQL Server was installed as shown. The final interactive window where you may test the connectivity is shown in the next figure. You may notice that the database Testmove has been named in the ODBC DSN. The name MySQL_LINK is the ODBC DSN. When you close the window after clicking the OK button, an ODBC DSN item will be added to the System DSN tab of the ODBC wizard as shown. Steps to create a MySQL linked server from Management Studio Right click the Linked Servers node to display a drop-down menu as shown in the next figure. Click on New Linked Server...item. This brings up the New Linked Server window as shown. The window is all empty except for a default Provider. The very first thing to do is to provide a name for this linked server. Herein it is LINKED_ MYSQL. We will be using one of the providers [Microsoft OLE DB Provider for ODBC] supported by SQL Server 2008. You may access the list of OLE DB Providers in the Providers sub-folder of the Linked Servers. Provide the other details as shown in the next figure. Make sure they are entered exactly as shown or according to how you have created the database on MySQL Server. Click on the Security list item under General to the left. This opens the 'Security' page of the New Linked Server wizard as shown. Change the login from the default "Be made without using a security context" to "Be made using this security context". Enter remote login. In this case, it is "root" for the remote login and the password is the one used during the ODBC DSN (also the password for server authentication) creation. Make no changes to the Server Options page. Click OK. This creates a linked server Linked_MySQL as shown expanded in the Linked Server's node as shown. You may need to right-click and refresh the Linked Servers' node to see the new linked server. As you can see in the figure, the 'User' tables are not displayed. Running Queries and reviewing results Running system stored procedures can provide various levels of information and the database can be queried using the four-part syntax and the openquery() method. Information on the linked server It is easy to find how the linked server is configured using system stored procedure sp_linkedsrvlogin on the SQL Server 2008. Open a Query window from File | New | Query Current Connection to open the query window and type in the following statement. The next figure shows the statement as well as the returned values. SQL Server 2008 querying has the intellisense report and this must be put to good use. Exec sp_linkedsrvlogin This shows all servers both local and remote as shown in the next figure. Information about the tables on the remote server can also be accessed by running a stored procedure. Executing the stored procedure sp_tables_ex as shown in the next figure (which displays the statement and the result of executing the stored procedure) can be used to obtain table information. Querying the table on the database Data in the table on the linked server can be queried using the openquery() function. The syntax for this function shown next is very simple. openquery ('linked server', 'query') The next figure shows the result of running the openquery() function on the Linked_MySQL linked server. Although it should be possible to query the linked server using the four-part syntax as in: Select * from LINKED_MYSQL...employees The above statement returns an error. This is probably a limitation of a combination of MSDASQL and the ODBC driver which does not provide the schema information correctly(this is just the author's opinion). Are Remote Procedure Calls (RPC) allowed? The easiest way to test this is to send out a call by running the following query against the linked server. Execute('Select FirstName, LastName from employees') at Linked_MYSQL If the linked server is not configured for RPC, then the result you get by running the above query is as shown in the next figure. Turn on RPC Earlier on we skipped the Server Options page of the linked server. Back in the Management Studio right click linked server LINKED_MYSQL and from the drop-down choose to look at the properties at the bottom of the list. This brings up the LINKED_MYSQL properties window as shown. Click on Server Options. In the Server Options page change the values of RPC and RPCOUT to true, default for both being false. Now run the query that produced the error previously. The result is displayed in the next figure. You might have noted that only two columns were returned from the employees table. This was deliberate as trying to get all the column would produce an error due partly to the data types of data stored in the table and their compatibility with MSDASQL and the ODBC driver (Again, an author's opinion). Creating Linked Server using TSQL While the linked server can be created using the built-in wizard of the Management Studio, it can also be created using TSQL statements as in the following listing (run both statements, the first one creates the linked server and the second the logins). Listing 2: Exec master.dbo.sp_addlinkedserver @server=N'MySQlbyCode', @srvprodcut=N'MySQL', @provider=N'MSDASQL', @datasrc=N'MySQL_link' Exec master.dbo.sp_addlinkedserverlogin @server=N'MySQlbyCode', @locallogin=NULL, @rmtuser=N'root', @rmtpassword=N'<your password>' @rmtsrvname=N'localhost' Summary The article described the steps involved in configuring a MySql Linked server on SQL Server 2008 using the built-in New Linked Server wizard as well as TSQL. Method to query the linked server as well as enabling RPC were described. If you have read this article you may be interested in the following: MySQL Data Transfer using SQL Server Integration Services (SSIS) Transferring Data from MS Access 2003 to SQL Server 2008 Exporting data from MS Access 2003 to MySQL

0
0
14067

article-image-mysql-data-transfer-using-sql-server-integration-services-ssis

Packt Editorial Staff

8 min read

MySQL Data Transfer using Sql Server Integration Services (SSIS)

Packt Editorial Staff

There are a large number of posts on various difficulties experienced while transferring data from MySQL using Microsoft SQL Server Integration Services. While the transfer of data from MySQL to Microsoft SQL Server 2008 is not fraught with any blocking issues, transfer of data from SQL Server 2008 to MySQL has presented various problems. There are some workarounds suggested. In this article by Dr. Jay Krishnaswamy, data transfer to MySQL using SQL Server Integration Services will be described. If you are new to SQL Server Integration Services (SSIS) you may want to read a book by the same author on Beginners Guide to SQL Server Integration Services Using Visual Studio 2005, published by Packt. Connectivity with MySQL For data interchange with MySQL there are two options one of which can be accessed in the connection wizards of SQL Server Integration Services assuming you have installed the programs. The other can be used to set up a ODBC DSN as described further down. The two connection options are: MySQL Connector/ODBC 5.1 Connector/Net 5.2 New versions 6.0 & 6.1 In this article we will be using the ODBC connector for MySQL which can be downloaded from the MySQL Site. The connector will be used to create an ODBC DSN. Transferring a table from SQL Server 2008 to MySQL We will transfer a table in the TestNorthwind database on SQL Server 2008 (Enterprise & Evaluation) to MySQL server database. The MySQL database we are using is described in the article on Exporting data from MS Access 2003 to MySQL. In another article, MySQL Linked Server on SQL Server 2008, creating an ODBC DSN for MySQL was described. We will be using the DSN created in that article. Creating an Integration Services project in Visual Studio 2008 Start the Visual Studio 2008 program from its shortcut. Click File | New | Project... to open the New Project window and select an integration services template from the business intelligence projects by providing a suitable name. The project folder will have a file called Package.dtsx which can be renamed with a custom name. Add and configure an ADO.NET Source The Project's package designer will be open displaying the Control Flow tab. Drag and drop a Data Flow Task on to the control flow tabbed page. Click next on the Data Flow tab in the designer to display the Data Flow page. Read the instructions on this page. Drag and drop a ADO.NET Source from the Data Flow Sources items in the Toolbox. It is assumed that you can set up a connection manager to the resident SQL Server 2008 on your machine. The next figure shows the configured connection manager to the SQL Server 2008. The table (PrincetonTemp) that will be transferred is in the TestNorthwind database. The authentication is Windows and a .NET provider is used to access the data. You may also test the connection by clicking the Test Connection button. If the connection shown above is correctly configured, the test should indicate a successful connection. Right click the ADO.NET source and from the drop-down click Edit. The ADO.NET Source Editor gets displayed. As mentioned earlier you should be able to access the table and view objects on the database as shown in the next figure. We have chosen to transfer a simple table, PrincetonTemp from the TestNorthwind database on SQL Server 2008. It has a only couple of columns as shown in the Columns page of the ADO.NET Source Editor. The default for the Error page setting has been assumed, that is, if there is an error or truncation of data the task will fail. Add an ADO.NET destination and port the data from the source Drag and drop an ADO.NET destination item from under Data Flow Destinations items in the Toolbox on to the data flow page of the designer. There are two ways to arrange for the data to flow from source to the destination. The easy way is just drag the green dangling line from the source with your mouse and let go on the ADO.NET destination. A solid line will connect the source and the destination as shown. (For more resources on Microsoft, see here.) Configure a connection manager to connect to MySQL In the Connection Manager's pane under the Package designer right click to display a pop-up menu which allows you to make a new connection. When you agree to make a new ADO.NET Connection the Configure ADO.NET connection Manager's window shows up and click on New... button on this page. The connection manager's page gets displayed as shown. In the Providers drop-down you will see a number of providers. There are the two providers that you can use, the ODBC through the connector and the MySQL Data Provider. Click on the Odbc Data Provider. As mentioned previously we will be using the System DSN MySQL_Link created earlier for the other article shown in the drop-down list of available ODBC DSN's. Provide the USERID and Password; click the Test Connection button. If all the information is correct you should get a success message as shown. Close out of the message as well as the Configure ADO.NET Connection Manager windows. Right click the ADO.NET Destination to display its editor window. In the drop-down for connection manager you should be able to pick the connection Manager you created in the previous step (MySQL_INK.root) as shown. Click on the New... button to create a Table or View. You will get a warning message regarding not knowing the mapping to SSIS as shown. Click OK. The create table window gets displayed as shown. Notice that the table is displaying all the columns from the table that the source is sending out. If you were to click OK, you would get an error that the syntax is not correct as shown. Modify the table as shown to change the destination table name (your choice) and the data type. CREATE TABLE From2k8( "Id" INT, "Month" VARCHAR(10), "Temperature" DOUBLE PRECISION, "RecordHigh" DOUBLE PRECISION ) Click OK. Again you get the same error regarding syntax not being correct. Modify the Create Table statement further as shown. CREATE TABLE From2k8 ( Id INT, Month VARCHAR(10), Temperature DOUBLE PRECISION, RecordHigh DOUBLE PRECISION ) Click OK after the above modification. The table gets added to the ADO.NET Destination Manager Editor as shown. Click on the Mappings on the left side of the ADO.NET Destination Editor. The column mappings page gets displayed as shown. We accept the default settings for Error Output page. Click OK. Build the project and execute the package by right clicking the package and choosing Execute Package. The program runs and processes the package and ends up being unsuccessful with the error message in the Progress tab of the project as shown (only relevant message is shown here). .... ..... [SSIS.Pipeline] Information: Execute phase is beginning. [ADO NET Destination 1 [165]] Error: An exception has occurred during data insertion, the message returned from the provider is: ERROR [42000] [MySQL][ODBC 5.1 Driver] [mysqld-5.1.30-community]You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '"Id", "Month", "Temperature", "RecordHigh") VALUES (1, 'Jan ', 4.000000000' at line 1 [SSIS.Pipeline] Error: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component "ADO NET Destination 1" (165) failed with error code 0xC020844B while processing input "ADO NET Destination Input" (168). The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure. [SSIS.Pipeline] Information: Post Execute phase is beginning. ...... .... Task Data Flow Task failed .... Start the MySQL Server and login to it. Run the following commands as shown in the next figure. By setting the mode to 'ANSI' makes the syntax more standard like as MySQL can cater to clients using other SQL modes. This is why the above error is returned although the syntax itself appears correct. In fact a create statement run on command line to create a table directly on MySQL could not create a table and returned an error when SSIS was used to create the same table. After running the above statements, build the BI project and execute the package. This time the execution is will be successful and you can query the MySQL Server as in the following: Summary The article describes step by step transferring a table from SQL Server 2008 to MySQL using ODBC connectivity. For successful transfer, the data type differences between SQL Server 2008 and the MySQL version must be properly taken into consideration as well as correctly setting the SQL_Mode property of MySQL Server. Further resources on this subject: Easy guide to understand WCF in Visual Studio 2008 SP1 and Visual Studio 2010 Express MySQL Linked Server on SQL Server 2008 Displaying MySQL data on an ASP.NET Web Page Exporting data from MS Access 2003 to MySQL Transferring Data from MS Access 2003 to SQL Server 2008

0
0
25808

article-image-creating-external-tables-oracle-10g11g-database

Packt Editorial Staff

16 min read

Creating External Tables in your Oracle 10g/11g Database

Packt Editorial Staff

In this two-part article by Hector R. Madrid, we will learn about the External Tables in Oracle 10g/11g Database. When working in data warehouse environments, the Extraction—Transformation—Loading (ETL) cycle frequently requires the user to load information from external sources in plain file format, or perform data transfers among Oracle database in a proprietary format. This requires the user to create control files to perform the load. As the format of the source data regularly doesn't fit the one required by the Data Warehouse, a common practice is to create stage tables that load data into the database and create several queries that perform the transformation from this point on, to take the data to its final destination. A better approach would be to perform this transformation 'on the fly' at load time. That is what External Tables are for. They are basically external files, managed either by means of the SQL*Loader or the data pump driver, which from the database perspective can be seen as if they were regular read-only tables. This format allows the user to think about the data source as if the data was already loaded into its stage table. This lets the user concentrate on the queries to perform the transformation, thus saving precious time during the load phase. The basics of an External Tables in Oracle10g/11g An External Table is basically a file that resides on the server side, as a regular flat file or as a data pump formatted file. The External Table is not a table itself; it is an external file with an Oracle format and its physical location. This feature first appeared back in Oracle 9i Release 1 and it was intended as a way of enhancing the ETL process by reading an external flat file as if it was a regular Oracle table. On its initial release it was only possible to create read-only External Tables, but, starting with 10g—it is possible to unload data to External Tables too. In previous 10g Releases, there was only the SQL*Loader driver could be used to read the External Table, but from 10g onwards it is now possible to load the table by means of the data pump driver. The kind of driver that will be used to read the External Table is defined at creation time. In the case of ORACLE_LOADER it is the same driver used by SQL*Loader. The flat files are loaded in the same way that a flat file is loaded to the database by means of the SQL*Loader utility, and the creation script can be created based on a SQL*Loader control file. In fact, most of the keywords that are valid for data loading are also valid to read an external flat file table. The main differences between SQL*Loader and External Tables are: When there are several input datafiles SQL*Loader will generate a bad file and a discard file for each datafile. The CONTINUEIF and CONCATENATE keywords are not supported by External Tables. The GRAPHIC, GRAPHIC EXTERNAL, and VARGRAPHIC are not supported for External Tables. LONG, nested tables, VARRAY, REF, primary key REF, and SID are not supported. For fields in External Tables the character set, decimal separator, date mask and other locale settings are determined by the database NLS settings. The use of the backslash character is allowed for SQL*Loader, but for External Tables this would raise an error. External Tables must use quotation marks instead.For example: SQL*Loader FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY "" External Tables TERMINATED BY ',' ENCLOSED BY "'" A second driver is available, the ORACLE_DATAPUMP access driver, which uses the Data Pump technology to read the table and unload data to an External Table. This driver allows the user to perform a logical backup that can later be read back to the database without actually loading the data. The ORACLE_DATAPUMP access driver utilizes a proprietary binary format for the external file, so it is not possible to view it as a flat file. Let's setup the environment Let's create the demonstration user, and prepare its environment to create an External Table. The example that will be developed first refers to the External Table using the ORACLE_LOADER driver. create user EXTTABDEMO identified by ORACLE default tablespace USERS; alter user exttabdemo quota unlimited on users; grant CREATE SESSION, CREATE TABLE, CREATE PROCEDURE, CREATE MATERIALIZED VIEW, ALTER SESSION, CREATE VIEW, CREATE ANY DIRECTORY to EXTTABDEMO; A simple formatted spool from this query will generate the required external table demonstration data. The original source table is the demonstration HR.EMPLOYEES table. select EMPLOYEE_ID ||','|| DEPARTMENT_ID ||','|| FIRST_NAME ||','|| LAST_NAME ||','|| PHONE_NUMBER ||','|| HIRE_DATE ||','|| JOB_ID ||','|| SALARY ||','|| COMMISSION_PCT ||','|| MANAGER_ID ||','|| EMAIL from HR.EMPLOYEES order by EMPLOYEE_ID The above query will produce the following sample data: The External Table directory is defined inside the database by means of a DIRECTORY object. This object is not validated at creation time, so the user must make sure the physical directory exists and the oracle OS user has read/write privileges on it. $ mkdir $HOME/external_table_dest SQL> CREATE DIRECTORY EXTTABDIR AS '/home/oracle/external_table_dest'; The above example was developed in a Linux environment, on Windows platforms the paths will need to be changed to correctly reflect how Oracle has been set up. Now, the first External Table can be created. A basic External Table Here is the source code of the External Table creation. The create table command syntax is just like any other regular table creation (A), (B), up to the point where the ORGANIZATION EXTERNAL (C) keyword appears, this is the point where the actual External Table definition starts. In this case the External Table is accessed by the ORACLE_LOADER driver (D). Next, the external flat file is defined, and here it is declared the Oracle DIRECTORY (E) where the flat file resides. The ACCESS PARAMETERS(F) section specifies how to access the flat file and it declares whether the file is a fixed or variable size record, and which other SQL*Loader loading options are declared. The LOCATION (H) keyword defines the name of the external data file. It must be pointed out that as this is an External Table managed by the SQL_LOADER driver the ACCESS_PARAMETERS section must be defined, in the case of External Tables based on the DATAPUMP_DRIVER this section is not required. The columns are defined only by name (G), not by type. This is permitted from the SQL*Loader perspective, and allows for dynamic column definition. This column schema definition is more flexible, but it has a drawback—data formats such as those in DATEcolumns must match the database date format, otherwise the row will be rejected. There are ways to define the date format working around this requirement. Assuming the date column changes from its original default format mask "DD-MON-RR" to "DD-MM-RR", then the column definition must change from a simple CHAR column to a DATE with format mask column definition. Original format: "HIRE_DATE" CHAR(255) Changed format: "HIRE_DATE" DATE "DD-MM-RR" When working with an External Table, the access parameter is not validated at creation time, so if there are malformed rows, or if there are improperly defined access parameters, an error is shown, similar to the one below. When working with an External Table, the access parameter is not validated at creation time, so if there are malformed rows, or if there are improperly defined access parameters, an error is shown, similar to the one below. ERROR at line 1: ORA-29913: error in executing ODCIEXTTABLEFETCH callout ORA-30653: reject limit reached ORA-06512: at "SYS.ORACLE_LOADER", line 52 Once the data is created and all required OS privileges have been properly validated, the data can be seen from inside the database, just as if it were a regular Oracle table. This table is read only, so if the user attempts to perform any DML operation against it, it will result in this error: SQL> delete ext_employees; delete ext_employees * ERROR at line 1: ORA-30657: operation not supported on external organized table As the error message clearly states, this kind of table is only useful for read only operations. This kind of table doesn't support most of the operations available for regular tables, such as index creation, and statistics gathering, and these types of operations will cause an ORA-30657 error too. The only access method available for External Tables is Full Table Scan, so there is no way to perform a selective data retrieval operation. The External Tables cannot be recovered, they are just metadata definitions stored in the dictionary tables. The actual data resides in external files, and there is no way to protect them with the regular backup database routines, so it is the user's sole responsibility to provide proper backup and data management procedures. At the database level the only kind of protection the External Table receives is at the metadata level, as it is an object stored as a definition at the database dictionary level. As the data resides in the external data file, if by any means it were to be corrupted, altered, or somehow modified, there would be no way to get back the original data. If the external data file is lost, then this may go unnoticed, until the next SELECT operation takes place. This metadata for an External Table is recorded at the {USER | ALL | DBA}_TABLES view, and as this table doesn't actually require physical database storage, all storage related columns appear as null, as well as the columns that relate to the statistical information. This table is described with the {USER | ALL | DBA}_EXTERNAL_TABLES view, where information such as the kind of driver access, the reject_limit, and the access_parameters, amongst others, are described. SQL> DESC USER_EXTERNAL_TABLES Name Null? Type ------------------------------- -------- -------------- TABLE_NAME NOT NULL VARCHAR2(30) TYPE_OWNER CHAR(3) TYPE_NAME NOT NULL VARCHAR2(30) DEFAULT_DIRECTORY_OWNER CHAR(3) DEFAULT_DIRECTORY_NAME NOT NULL VARCHAR2(30) REJECT_LIMIT VARCHAR2(40) ACCESS_TYPE VARCHAR2(7) ACCESS_PARAMETERS VARCHAR2(4000) PROPERTY VARCHAR2(10) This is the first basic External Table, and as previously shown, its creation is pretty simple. It allows external data to be easily accessed from inside the database, allowing the user to see the external data just as if it was already loaded inside a regular stage table. Creating External Table metadata, the easy way To further illustrate the tight relationship between SQL*Loader and External Tables, the SQL*Loader tool may be used to generate a script that creates an External Table according to a pre-existing control file. SQL*Loader has a command line option named EXTERNAL_TABLE, this can hold one of three different parameters {NOT_USED | GENERATE_ONLY | EXECUTE}. If nothing is set, it defaults to the NOT_USED option. This keyword is used to generate the script to create an External Table, and the options mean: NOT_USED: This is the default option, and it means that no External Tables are to be used in this load. GENERATE_ONLY: If this option is specified, then SQL*Loader will only read the definitions from the control file and generate the required commands, so the user can record them for later execution, or tailor them to fit his/her particular needs. EXECUTE: This not only generates the External Table scripts, but also executes them. If the user requires a sequence, then the EXECUTE option will not only create the table, but it will also create the required sequence, deleting it once the data load is finished. This option performs the data load process against the specified target regular by means of an External Table, it creates both the directory and the External Table, and inserts the data using a SELECT AS INSERTwith the APPEND hint. Let's use the GENERATE_ONLY option to generate the External Table creation scripts: $ sqlldr exttabdemo/oracle employees external_table=GENERATE_ONLY By default the log file is located in a file whose extension is .log and its name equals that of the control file. By opening it we see, among the whole log processing information, this set of DDL commands: CREATE TABLE "SYS_SQLLDR_X_EXT_EMPLOYEES" ( "EMPLOYEE_ID" NUMBER(6), "FIRST_NAME" VARCHAR2(20), "LAST_NAME" VARCHAR2(25), "EMAIL" VARCHAR2(25), "PHONE_NUMBER" VARCHAR2(20), "HIRE_DATE" DATE, "JOB_ID" VARCHAR2(10), "SALARY" NUMBER(8,2), "COMMISSION_PCT" NUMBER(2,2), "MANAGER_ID" NUMBER(6), "DEPARTMENT_ID" NUMBER(4) ) ORGANIZATION external ( TYPE oracle_loader DEFAULT DIRECTORY EXTTABDIR ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII BADFILE 'EXTTABDIR':'employees.bad' LOGFILE 'employees.log_xt' READSIZE 1048576 FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' LDRTRIM REJECT ROWS WITH ALL NULL FIELDS ( "EMPLOYEE_ID" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "FIRST_NAME" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "LAST_NAME" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "EMAIL" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "PHONE_NUMBER" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "HIRE_DATE" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "JOB_ID" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "SALARY" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "COMMISSION_PCT" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "MANAGER_ID" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"', "DEPARTMENT_ID" CHAR(255) TERMINATED BY "," OPTIONALLY ENCLOSED BY '"' ) ) location ( 'employees.txt' ) ) The more complete version is shown, some differences with the basic script are: All the column definitions are set to CHAR(255) with the delimiter character defined for each column If the current working directory is already registered as a regular DIRECTORY at the database level, SQL*Loader utilizes it, otherwise, it creates a new directory definition The script specifies where the bad files and log file are located It specifies that an all-null column record is rejected The more complete version is shown, some differences with the basic script are: All the column definitions are set to CHAR(255) with the delimiter character defined for each column If the current working directory is already registered as a regular DIRECTORY at the database level, SQL*Loader utilizes it, otherwise, it creates a new directory definition The script specifies where the bad files and log file are located It specifies that an all-null column record is rejected In the case of the EXECUTE keyword, the log file shows that not only are the scripts used to create the External Table, but also to execute the INSERT statement with the /*+ append */hint. The load is performed in direct path mode. All External Tables, when accessed, generate a log file. In the case of the ORACLE_LOADERdriver, this file is similar to the file generated by SQL*Loader. It has a different format in the case of ORACLE_DATAPUMP driver. The log file is generated in the same location where the external file resides, and its format is as follows: <EXTERNAL_TABLE_NAME>_<OraclePID>.log When an ORACLE_LOADER managed External Table has errors, it dumps the 'bad' rows to the *.bad file, just the same as if this was loaded by SQL*Loader. The ORACLE_DATAPUMP External Table generates a simpler log file, it only contains the time stamp when the External Table was accessed, and it creates a log file for each oracle process accessing the External Table. Unloading data to External Tables The driver used to unload data to an External Table is the ORACLE_DATAPUMP access driver. It dumps the contents of a table in a binary proprietary format file. This way you can exchange data with other 10g and higher databases in a preformatted way to meet the other database's requirements. Unloading data to an External Table doesn't make it updateable, the tables are still limited to being read only. Let's unload the EMPLOYEES table to an External Table: create table dp_employees organization external( type oracle_datapump default directory EXTTABDIR location ('dp_employees.dmp') ) as select * from employees; This creates a table named DP_EMPLOYEES, located at the specified EXTTABDIR directory and with a defined OS file name. In the next example, at a different database a new DP_EMPLOYEES table is created, this table uses the already unloaded data by the first database. This DP_EMPLOYEES External Table is created on the 11g database side. create table dp_employees( EMPLOYEE_ID NUMBER(6), FIRST_NAME VARCHAR2(20), LAST_NAME VARCHAR2(25), EMAIL VARCHAR2(25), PHONE_NUMBER VARCHAR2(20), HIRE_DATE DATE, JOB_ID VARCHAR2(10), SALARY NUMBER(8,2), COMMISSION_PCT NUMBER(2,2), MANAGER_ID NUMBER(6), DEPARTMENT_ID NUMBER(4) ) organization external ( type oracle_datapump default directory EXTTABDIR location ('dp_employees.dmp') ); This table can already read in the unloaded data from the first database. The second database is a regular 11g database. So this shows the inter-version upward compatibility between a 10g and an 11g database. SQL> select count(*) from dp_employees; COUNT(*) ---------- 107 Inter-version compatibility In, the previous example a 10g data pump generated an External Table that was transparently read by the 11g release. Let's create an 11g data pump External Table named DP_DEPARTMENTS: create table dp_departments organization external( type oracle_datapump default directory EXTTABDIR access parameters ( version '10.2.0' ) location ('dp_departments.dmp') ) as select * from departments Table created. SQL> select count(*) from dp_departments; COUNT(*) ---------- 27 In the previous example, it is important to point out that the VERSION keyword defines the compatibility format. access parameters ( version '10.2.0' ) If this clause is not specified then an incompatibility error will be displayed. SQL> select count(*) from dp_departments; select count(*) from dp_departments * ERROR at line 1: ORA-29913: error in executing ODCIEXTTABLEOPEN callout ORA-39142: incompatible version number 2.1 in dump file "/home/oracle/external_table_dest/dp_departments.dmp" ORA-06512: at "SYS.ORACLE_DATAPUMP", line 19 Now let's use the 10g version to read from it. SQL> select count(*) from dp_departments; COUNT(*) ---------- 27 The VERSION clause is interpreted the same way as the VERSION clause for the data pump export, it has three different values: COMPATIBLE: This states that the version of the metadata corresponds to the database compatibility level. LATEST: This corresponds to the database version. VERSION NUMBER: This refers to a specific oracle version that the file is compatible with. This value cannot be lower than 9.2.0 Summary As we can see, External Tables can serve not only as improvements to the ETL process, but also as a means to manage database environments, and a means of reducing the complexity level of data management from the user's point of view. In the next part, we will see how External Tables can be used for data transformation and the enhancements Oracle 11g has brought about in External Tables.

0
0
14242

Previous
77
78
79
80
81
Next