Data | Tech News, Tutorials & Expert Insights

article-image-getting-started-oracle-information-integration

02 Aug 2011

14 min read

Getting Started with Oracle Information Integration

02 Aug 2011

Oracle Information Integration, Migration, and Consolidation: RAW The definitive book and eBook guide to Oracle Information Integration and Migration in a heterogeneous world Read more about this book (For more resources on Oracle, see here.) Why consider information integration? The useful life of pre-relational mainframe database management system engines is coming to an end because of a diminishing application and skills base, and increasing costs.—Gartner Group During the last 30 years, many companies have deployed mission critical applications running various aspects of their business on the legacy systems. Most of these environments have been built around a proprietary database management system running on the mainframe. According to Gartner Group, the installed base of mainframe, Sybase, and some open source databases has been shrinking. There is vendor sponsored market research that shows mainframe database management systems are growing, which, according to Gartner, is due primarily to increased prices from the vendors, currency conversions, and mainframe CPU replacements. Over the last few years, many companies have been migrating mission critical applications off the mainframe onto open standard Relational Database Management Systems (RDBMS) such as Oracle for the following reasons: Reducing skill base: Students and new entrants to the job market are being trained on RDBMS like Oracle and not on the legacy database management systems. Legacy personnel are retiring, and those that are not are moving into expensive consulting positions to arbitrage the demand. Lack of flexibility to meet business requirements: The world of business is constantly changing and new business requirements like compliance and outsourcing require application changes. Changing the behavior, structure, access, interface or size of old databases is very hard and often not possible, limiting the ability of the IT department to meet the needs of the business. Most applications on the aging platforms are 10 to 30 years old and are long past their original usable lifetime. Lack of Independent Software Vendor (ISV)applications: With most ISVs focusing on the larger market, it is very difficult to find applications, infrastructure, and tools for legacy platforms. This requires every application to be custom coded on the closed environment by scarce in-house experts or by expensive outside consultants. Total Cost of Ownership (TCO): As the user base for proprietary systems decreases, hardware, spare parts, and vendor support costs have been increasing. Adding to this are the high costs of changing legacy applications, paid either as consulting fees for a replacement for diminishing numbers of mainframe trained experts or increased salaries for existing personnel. All leading to a very high TCO which doesn't even take into account the opportunity cost to the business of having inflexible systems. Business challenges in data integration and migration Once the decision has been taken to migrate away from a legacy environment, the primary business challenge is business continuity. Since many of these applications are mission critical, running various aspects of the business, the migration strategy has to ensure continuity to the new application—and in the event of failure, rollback to the mainframe application. This approach requires data in the existing application to be synchronized with data on the new application. Making the challenge of data migration more complicated is the fact that legacy applications tend to be interdependent, but the need from a risk mitigation standpoint is to move applications one at a time. A follow-on challenge is prioritizing the order in which applications are to be moved off the mainframe, and ensuring that the order meets both the business needs and minimizes the risk in the migration process. Once a specific application is being migrated, the next challenge is to decide which business processes will be migrated to the new application. Many companies have business processes that are present, because that's the way their systems work. When migrating an application off the mainframe, many business processes do not need to migrate. Even among the business processes that need to be migrated, some of these business processes will need to be moved as-is and some of them will have to be changed. Many companies utilize the opportunity afforded by a migration to redo the business processes they have had to live with for many years. Data is the foundation of the modernization process. You can move the application, business logic, and work flow, but without a clean migration of the data the business requirements will not be met. A clean data migration involves: Data that is organized in a usable format by all modern tools Data that is optimized for an Oracle database Data that is easy to maintain Technical challenges of information integration The technical challenges with any information integration all stem from the fact that the application accesses heterogeneous data (VSAM, IMS, IDMS, ADABAS, DB2, MSSQL, and so on) that can even be in a non-relational hierarchical format. Some of the technical problems include: The flexible file definition feature used in COBOL applications in the existing system will have data files with multi-record formats and multi-record types in the same dataset—neither of which exist in RDBMS. Looping data structure and substructure or relative offset record organization such as a linked list, which are difficult to map into a relational table. Data and referential integrity is managed by the Oracle database engine. However, legacy applications already have this integrity built in. One question is whether to use Oracle to handle this integrity and remove the logic from the application. Finally, creating an Oracle schema to maximize performance, which includes mapping non-oracle keys to Oracle primary and secondary keys; especially when legacy data is organized in order of key value which can affect the performance on an Oracle RDBMS. There are also differences in how some engines process transactions, rollbacks, and record locking. General approaches to information integration and migration There are several technical approaches to consider when doing any kind of integration or migration activity. In this section, we will look at a methodology or approach for both data integration and data migration. Data integration Clearly, given this range of requirements, there are a variety of different integration strategies, including the following: Consolidated: A consolidated data integration solution moves all data into a single database and manages it in a central location. There are some considerations that need to be known regarding the differences between non- Oracle and Oracle mechanics. Transaction processing is an example. Some engines use implicit commits and some manage character sets differently than Oracle does, this has an impact on sort order. Federated: A federated data integration solution leaves data in the individual data source where it is normally maintained and updated, and simply consolidates it on the fly as needed. In this case, multiple data sources will appear to be integrated into a single virtual database, masking the number and different kinds of databases behind the consolidated view. These solutions can work bidirectionally. Shared: A shared data integration solution actually moves data and events from one or more source databases to a consolidated resource, or queue, created to serve one or more new applications. Data can be maintained and exchanged using technologies such as replication, message queuing, transportable table spaces, and FTP. Oracle has extensive support for consolidated data integration and while there are many obvious benefits to the consolidated solution, it is not practical for any organization that must deal with legacy systems or integrate with data it does not own. Therefore, we will not discuss this type any further, but instead concentrate on federated and shared solutions. Data migration Over 80 percent of migration projects fail or overrun their original budgets/ timelines, according to a study by the Standish Group. In most cases, this is because of a lack of understanding of some of the unique challenges of a migration project. The top five challenges of a migration project are: Little migration expertise to draw from: Migration is not an industry-recognized area of expertise with an established body of knowledge and practices, nor have most companies built up any internal competency to draw from. Insufficient understanding of data and source systems: The required data is spread across multiple source systems, not in the right format, of poor quality, only accessible through vaguely understood interfaces, and sometimes missing altogether. Continuously evolving target system: The target system is often under development at the time of data migration, and the requirements often change during the project. Complex target data validations: Many target systems have restrictions, constraints, and thresholds on the validity, integrity, and quality of the data to be loaded. Repeated synchronization after the initial migration: Migration is not a one-time effort. Old systems are usually kept alive after new systems launch and synchronization is required between the old and new systems during this handoff period. Also, long after the migration is completed, companies often have to prove the migration was complete and accurate to various government, judicial, and regulatory bodies. Most migration projects fail because of an inappropriate migration methodology, because the migration problem is thought of as a four stage process: Analyze the source data Extract/transform the data into the target formats Validate and cleanse the data Load the data into the target However, because of the migration challenges discussed previously, this four stage project methodology often fails miserably. The challenge begins during the initial analysis of the source data when most of the assumptions about the data are proved wrong. Since there is never enough time planned for analysis, any mapping specification from the mainframe to Oracle is effectively an intelligent guess. Based on the initial mapping specification, extractions, and transformations developed run into changing target data requirements, requiring additional analysis and changes to the mapping specification. Validating the data according to various integrity and quality constraints will typically pose a challenge. If the validation fails, the project goes back to further analysis and then further rounds of extractions and transformations. When the data is finally ready to be loaded into Oracle, unexpected data scenarios will often break the loading process and send the project back for more analysis, more extractions and transformations, and more validations. Approaching migration as a four stage process means continually going back to earlier stages due to the five challenges of data migration. The biggest problem with migration project methodology is that it does not support the iterative nature of migrations. Further complicating the issue is that the technology used for data migration often consists of general-purpose tools repurposed for each of the four project stages. These tools are usually non-integrated and only serve to make difficult processes more difficult on top of a poor methodology. The ideal model for successfully managing a data migration project is not based on multiple independent tools. Thus, a cohesive method enables you to cycle or spiral your way through the migration process—analyzing the data, extracting and transforming the data, validating the data, and loading it into targets, and repeating the same process until the migration is successfully completed. This approach enables target-driven analysis, validating assumptions, refining designs, and applying best practices as the project progresses. This agile methodology uses the same four stages of analyze, extract/transform, validate and load. However, the four stages are not only iterated, but also interconnected with one another. An iterative approach is best achieved through a unified toolset, or platform, that leverages automation and provides functionality which spans all four stages. In an iterative process, there is a big difference between using a different tool for each stage and one unified toolset across all four stages. In one unified toolset, the results of one stage can be easily carried into the next, enabling faster, more frequent and ultimately less iteration which is the key to success in a migration project. A single platform not only unifies the development team across the project phases, but also unifies the separate teams that may be handling each different source system in a multi-source migration project. Architectures: federated versus shared Federated data integration can be very complicated. This is especially the case for distributed environments where several heterogeneous remote databases are to be synchronized using two-phase commit. Solutions that provide federated data integration access and maintain the data in the place wherever it resides (such as in a mainframe data store associated with legacy applications). Data access is done 'transparently' for example, the user (or application) interacts with a single virtual or federated relational database under the control of the primary RDBMS, such as Oracle. This data integration software is working with the primary RDBMS 'under the covers' to transform and translate schemas, data dictionaries, and dialects of SQL; ensure transactional consistency across remote foreign databases (using two-phase commit); and make the collection of disparate, heterogeneous, distributed data sources appear as one unified database. The integration software carrying out these complex tasks needs to be tightly integrated with the primary RDBMS in order to benefit from built-in functions and effective query optimization. The RDBMS must also provide all the other important RDBMS functions, including effective query optimization. Data sharing integration Data sharing-based integration involves the sharing of data, transactions, and events among various applications in an organization. It can be accomplished within seconds or overnight, depending on the requirement. It may be done in incremental steps, over time, as individual one-off implementations are required. If one-off tools are used to implement data sharing, eventually the variety of data-sharing approaches employed begin to conflict, and the IT department becomes overwhelmed with an unmanageable maintenance, which increases the total cost of ownership. What is needed is a comprehensive, unified approach that relies on a standard set of services to capture, stage, and consume the information being shared. Such an environment needs to include a rules-based engine, support for popular development languages, and comply with open standards. GUI-based tools should be available for ease of development and the inherent capabilities should be modular enough to satisfy a wide variety of possible implementation scenarios. The data-sharing form of data integration can be applied to achieve near real-time data sharing. While it does not guarantee the level of synchronization inherent with a federated data integration approach (for example, if updates are performed using two-phase commit), it also doesn't incur the corresponding performance overhead. Availability is improved because there are multiple copies of the data. Considerations when choosing an integration approach There is a range in the complexity of data integration projects from relatively straightforward (for example, integrating data from two merging companies that used the same Oracle applications) to extremely complex projects such as long-range geographical data replication and multiple database platforms. For each project, the following factors can be assessed to estimate the complexity level. Pretend you are a systems integrator such as EDS trying to size a data integration effort as you prepare a project proposal. Potential for conflicts: Is the data source updated by more than one application? If so, the potential exists for each application to simultaneously update the same data. Latency: What is the required synchronization level for the data integration process? Can it be an overnight batch operation like a typical data warehouse? Must it be synchronous, and with two-phase commit? Or, can it be quasi-real-time, where a two or three second lag is tolerable, permitting an asynchronous solution? Transaction volumes and data growth trajectory: What are the expected average and peak transaction rates and data processing throughput that will be required? Access patterns: How frequently is the data accessed and from where? Data source size: Some data sources of such volume that back up, and unavailability becomes extremely important. Application and data source variety: Are we trying to integrate two ostensibly similar databases following the merger of two companies that both use the same application, or did they each have different applications? Are there multiple data sources that are all relational databases? Or are we integrating data from legacy system files with relational databases and realtime external data feeds? Data quality: The probability that data quality adds to overall project complexity increases as the variety of data sources increases. One point of this discussion is that the requirements of data integration projects will vary widely. Therefore, the platform used to address these issues must be a rich superset of the features and functions that will be applied to any one project.

0
0
2387

Packt

01 Aug 2011

8 min read

Oracle Tools and Products

Packt

01 Aug 2011

8 min read

0
0
7298

article-image-integrating-kettle-and-pentaho-suite

Packt

14 Jul 2011

13 min read

Integrating Kettle and the Pentaho Suite

Packt

14 Jul 2011

13 min read

0
0
6683

article-image-excel-2010-financials-identifying-profitability-investment

Packt

12 Jul 2011

5 min read

Excel 2010 Financials: Identifying the Profitability of an Investment

Packt

12 Jul 2011

5 min read

Excel 2010 Financials Cookbook Powerful techniques for financial organization, analysis, and presentation in Microsoft Excel Read more about this book (For more resources on this subject, see here.) Calculating the depreciation of assets Assets within a business are extremely important for a number of reasons. Assets can become investments for growth or investments in another line of business. Assets can also take on many forms such as computer equipment, vehicles, furniture, buildings, land, and so on. Assets are not only important within the business for which they are used, but they are also used as a method of reducing the tax burden on a business. As a financial manager, you are tasked with calculating the depreciation expense for a laptop computer with a useful life of five years. In this recipe, you will learn to calculate the depreciation of an asset over the life of the asset. Getting ready There are several different methods of depreciation. A business may use straight-line depreciation, declining depreciation, double-declining depreciation, or a variation of these methods. Excel has the functionality to calculate each of the methods with a slight variation to the function; however, in this recipe, we will use a straight-line depreciation. Straight-line depreciation provides equal reduction of an asset over its life. Getting ready We will first need to set up the Excel worksheet to hold the depreciation values for the laptop computer: In cell A5 list Year and in cell B5 list 1. This will account for the depreciation for the year that the asset was purchased. Continue this list until all five years are listed: In cell B2, list the purchase price of the Laptop computer; the purchase price is $2500: In cell B4, we will need to enter the salvage value of the asset. The salvage value will be the estimated resale value of the asset when it is useful life, as determined by generally accepted accounting principles, has elapsed. Enter $500 in cell B3: In cell B5 enter the formula =SLN($B$2,$B$3,5) and press Enter: Copy the formula from cell C5 and paste it through cell C9: Excel now has listed the straight-line depreciation expense for each of the five years. As you can see in this schedule, the depreciation expense remains consistent through each year of the asset's useful life. How it works... Straight-line depreciation calculates the value of the purchase price minus the salvage price, and divides the remainder across the useful life. The function accepts inputs as follows =SLN(purchase price, salvage price, useful life). There's more... Other depreciation methods are as follows: Calculating the future versus current value of your money When working within finance, accounting, or general business it is important to know how much money you have. However, knowing how much money you have now is only a portion of the whole financial picture. You must also know how much your money will be worth in the future. Knowing future value allows you to know truly how much your money is worth, and with this knowledge, you can decide what you need to do with it. As a financial manager, you must provide feedback on whether to introduce a new product line. As with any new venture, there will be several related costs including start-up costs, operational costs, and more. Initially, you must spend $20,000 to account for most start-up costs and you will potentially, for the sake of the example, earn a profit of $5500 for five years. You also know due to expenditures, you expect your cost of capital to be 10%. In this recipe, you will learn to use Excel functions to calculate the future value of the venture and whether this proves to be profitable. How to do it... We will first need to enter all known values and variables into the worksheet: In cell B2, enter the initial cost of the business venture: In cell B3, enter the discount rate, or the cost of capital of 10%: In cells B4 through B8, enter the five years' worth of expected net profit from the business venture: In cell B10, we will calculate the net present value: Enter the formula =NPV(B3,B4:B8) and press Enter We now see that accounting for future inflows, the net present value of the business venture is $20,849.33. Our last step is to account for the initial start-up costs and determine the overall profitability. In cell B11, enter the formula =B10/B2 and press Enter: As a financial manager, we now see that for every $1 invested in this venture, you will receive $1.04 in present value inflows. How it works... NPV or net present value is calculated in Excel using all of the inflow information that was entered across the estimated period. For the five years used in this recipe, the venture shows a profit of $5500 for each year. This number cannot be used directly, because there is a cost of making money. The cost in this instance pertains to taxes and other expenditures. In the NPV formula, we did not include the initial cost of start-up because this cost is exempt from the cost of capital; however, it must be used at the end of the formula to account for the outflow compared to the inflows. There's more... The $1.04 value calculated at the end of this recipe is also known as the profitability index. When this index is greater than one, the venture is said to be a positive investment.

0
0
5172

article-image-getting-started-oracle-goldengate

Packt

12 Jul 2011

8 min read

Getting Started with Oracle GoldenGate

Packt

12 Jul 2011

8 min read

What is GoldenGate? Oracle GoldenGate is Oracle's strategic solution for real-time data integration. GoldenGate software enables mission-critical systems to have continuous availability and access to real-time data. It offers a fast and robust solution for replicating transactional data between operational and analytical systems. Oracle GoldenGate captures, filters, routes, verifies, transforms, and delivers transactional data in real-time, across Oracle and heterogeneous environments with very low impact and preserved transaction integrity. The transaction data management provides read consistency, maintaining referential integrity between source and target systems. As a competitor to Oracle GoldenGate, data replication products, and solutions exist from other software companies and vendors. These are mainly storage replication solutions that provide a fast point in time data restoration. The following is a list of the most common solutions available today: EMC SRDF and EMC RecoverPoint IBM PPRC and Global Mirror (known together as IBM Copy Services) Hitachi TrueCopy Hewlett-Packard Continuous Access (HP CA) Symantec Veritas Volume Replicator (VVR) DataCore SANsymphony and SANmelody FalconStor Replication and Mirroring Compellent Remote Instant Replay Data replication techniques have improved enormously over the past 10 years and have always been a requirement in nearly every IT project in every industry. Whether for Disaster Recovery (DR), High Availability (HA), Business Intelligence (BI), or even regulatory reasons, the requirements and expected performance have also increased, making the implementation of efficient and scalable data replication solutions a welcome challenge. Oracle GoldenGate evolution GoldenGate Software Inc was founded in 1995. Originating in San Francisco, the company was named after the famous Golden Gate Bridge by its founders, Eric Fish and Todd Davidson. The tried and tested product that emerged quickly became very popular within the financial industry. Originally designed for the fault tolerant Tandem computers, the resilient and fast data replication solution was in demand. The banks initially used GoldenGate software in their ATM networks for sending transactional data from high street machines to mainframe central computers. The data integrity and guaranteed zero data loss is obviously paramount and plays a key factor. The key architectural properties of the product are as follows: Data is sent in "real time" with sub-second speed. Supports heterogeneous environments across different database and hardware types. "Transaction aware" —maintaining its read-consistent and referential integrity between source and target systems. High performance with low impact; able to move large volumes of data very efficiently while maintaining very low lag times and latency. Flexible modular architecture. Reliable and extremely resilient to failure and data loss. No single point of failure or dependencies, and easy to recover. Oracle Corporation acquired GoldenGate Software in September 2009. Today there are more than 500 customers around the world using GoldenGate technology for over 4000 solutions, realizing over $100 million in revenue for Oracle. Oracle GoldenGate solutions Oracle GoldenGate provides five data replication solutions: High Availability Live Standby for an immediate fail-over solution that can later re-synchronize with your primary source. Active-Active solutions for continuous availability and transaction load distribution between two or more active systems. Zero-Downtime Upgrades and Migrations Eliminates downtime for upgrades and migrations. Live Reporting Feeding a reporting database so as not to burden the source production systems with BI users or tools. Operational Business Intelligence (BI) Real-time data feeds to operational data stores or data warehouses, directly or via Extract Transform and Load (ETL) tools. Transactional Data Integration Real-time data feeds to messaging systems for business activity monitoring, business process monitoring, and complex event processing. Uses event-driven architecture and service-oriented architecture (SOA). The following diagram shows the basic architecture for the various solutions available from GoldenGate software: We have discovered there are many solutions where GoldenGate can be applied. Now we can dive into how GoldenGate works, the individual processes, and the data flow that is adopted for all. Oracle GoldenGate technology overview Let's take a look at GoldenGate's fundamental building blocks; the Capture process, Trail files, Data pump, Server collector, and Apply processes. In fact, the order in which the processes are listed depicts the sequence of events for GoldenGate data replication across distributed systems. A Manager process runs on both the source and the target systems that "oversee" the processing and transmission of data. All the individual processes are modular and can be easily decoupled or combined to provide the best solution to meet the business requirements. It is normal practice to configure multiple Capture and Apply processes to balance the load and enhance performance. Filtering and transformation of the data can be done at either the source by the Capture or at the target by the Apply processes. This is achieved through parameter files. The capture process (Extract) Oracle GoldenGate's capture process, known as Extract, obtains the necessary data from the databases' transaction logs. For Oracle, these are the online redo logs that contain all the data changes made in the database. GoldenGate does not require access to the source database and only extracts the committed transactions from the online redo logs. It can, however, read archived redo logs to extract the data from long-running transactions. The Extract process will regularly checkpoint its read and write position, typically to a file. The checkpoint data insures GoldenGate can recover its processes without data loss in the case of failure. The Extract process can have one of the following statuses: STOPPED STARTING RUNNING ABENDED The ABENDED status stems back to the Tandem computer, where processes either stop (end normally) or abend (end abnormally). Abend is short for an abnormal end. Trail files To replicate transactional data efficiently from one database to another, Oracle GoldenGate converts the captured data into a Canonical Format which is written to trail files, both on the source and the target system. The provision of the source and target trail files in the GoldenGates architecture eliminates any single point of failure and ensures data integrity is maintained. A dedicated checkpoint process keeps track of the data being written to the trails on both the source and target for fault tolerance. It is possible to configure GoldenGate not to use trail files on the source system and write data directly from the database's redo logs to the target server data collector. In this case, the Extract process sends data in large blocks across a TCP/IP network to the target system. However, this configuration is not recommended due to the possibility of data loss occurring during unplanned system or network outages. Best practice states, the use of local trail files would provide a history of transactions and support the recovery of data for retransmission via a Data Pump. Data pump When using trail files on the source system, known as a local trail, GoldenGate requires an additional Extract process called Data pump that sends data in large blocks across a TCP/IP network to the target system. As previously stated, this is best practice and should be adopted for all Extract configurations. Server collector The server collector process runs on the target system and accepts data from the source (Extract/Data Pump). Its job is to reassemble the data and write it to a GoldenGate trail file, known as a remote trail. The Apply process (Replicat) The Apply process, known in GoldenGate as Replicat, is the final step in the data delivery. It reads the trail file and applies it to the target database in the form of DML (deletes, updates, and inserts) or DDL*. (database structural changes). This can be concurrent with the data capture or performed later. The Replicat process will regularly checkpoint its read and write position, typically to a file. The checkpoint data ensures that GoldenGate can recover its processes without data loss in the case of failure. The Replicat process can have one of the following statuses: STOPPED STARTING RUNNING ABENDED * DDL is only supported in unidirectional configurations and non-heterogeneous (Oracle to Oracle) environments. The Manager process The Manager process runs on both source and target systems. Its job is to control activities such as starting, monitoring, and restarting processes; allocating data storage; and reporting errors and events. The Manager process must exist in any GoldenGate implementation. However, there can be only one Manager process per Changed Data Capture configuration on the source and target. The Manager process can have either of the following statuses: STOPPED RUNNING GGSCI In addition to the processes previously described, Oracle GoldenGate 10.4 ships with its own command line interface known as GoldenGate Software Command Interface (GGSCI). This tool provides the administrator with a comprehensive set of commands to create, configure, and monitor all GoldenGate processes. Oracle GoldenGate 10.4 is command-line driven. However, there is a product called Oracle GoldenGate Director that provides a GUI for configuration and management of your GoldenGate environment. Process data flow The following diagram illustrates the GoldenGate processes and their dependencies. The arrows largely depict replicated data flow (committed transactions), apart from checkpoint data and configuration data. The Extract and Replicat processes periodically checkpoint to a file for persistence. The parameter file provides the configuration data. As described in the previous paragraphs, two options exist for sending data from source to target; these are shown as broken arrows: Having discovered all the processes required for GoldenGate to replicate data, let's now dive a little deeper into the architecture and configurations.

0
0
6535

article-image-building-financial-functions-excel-2010

Packt

07 Jul 2011

5 min read

Building Financial Functions into Excel 2010

Packt

07 Jul 2011

5 min read

Excel 2010 Financials Cookbook Powerful techniques for financial organization, analysis, and presentation in Microsoft Excel Till now, in the previous articles, we have focused on manipulating data within and outside of Excel in order to prepare to make financial decisions. Now that the data has been prepared, re-arranged, or otherwise adjusted, we are able to leverage the functions within Excel to make actual decisions. Utilizing these functions and the individual scenarios, we will be able to effectively eliminate the uncertainty due to poor analysis. Since this article utilizes financial scenarios for demonstrating the use of the various functions, it is important to note that these scenarios take certain "unknowns" for granted, and makes a number of assumptions in order to minimize the complexity of the calculation. Real-world scenarios will require a greater focus on calculating and accounting for all variables. Determining standard deviation for assessing risk In the recipes mentioned so far, we have shown the importance of monitoring and analyzing frequency to determine the likelihood that an event will occur. Standard deviation will now allow for an analysis of the frequency in a different manner, or more specifically, through variance. With standard deviation, we will be able to determine the basic top and bottom thresholds of data, and plot general movement within that threshold to determine the variance within the data range. This variance will allow the calculation of risk within investments. As a financial manager, you must determine the risk associated with investing capital in order to gain a return. In this particular instance, you will invest in stock. In order to minimize loss of investment capital, you must determine the risk associated between investing between two different stocks, Stock A, and Stock B. In this recipe, we will utilize standard deviation to determine which stock, either A or B, presents a higher risk, and hence a greater risk of loss. How to do it... We will begin by entering the selling prices of Stock A and Stock B in columns A and B, respectively: Within this list of selling prices, at first glance we can see that Stock B has a higher selling price. The stock opening price and selling price over the course of 52 weeks almost always remains above that of Stock A. As an investor looking to gain a higher return, we may wish to choose Stock B based on this cursory review; however, high selling price does not negate the need for consistency. In cell C2, enter the formula =STDEV(A2:A53) and press Enter: In cell C3, enter the formula =STDEV(B2:B53) and press Enter: We can see from the calculation of standard deviation, that Stock B has a deviation range or variance of over $20, whereas Stock A's variance is just over $9: Given this information, we can determine that Stock A presents a lower risk than Stock B. If we invest in Stock A, at any given time, utilizing past performance, our average risk of loss is $9, whereas in Stock B we an average risk of $20. How it works... The function of STDEV or standard deviation in Excel utilizes the given numbers as a complete population. This means that it does not account for any other changes or unknowns. Excel will use this data set as a complete set and determine the greatest change from high to low within the numbers. This range of change is your standard deviation. Excel also includes the function STDEVP that treats the data as a selection of a larger population. This function should be sed if you are calculating standard deviation on a subset of data (for example, six months out of an entire year). If we translate these numbers into a line graph with standard deviation bars, as shown in the following screenshot for Stock A, you can see the selling prices of the stock, and how they travel within the deviation range: If we translate these numbers into a line graph with standard deviation bars, as shown in the following screenshot for Stock B, you can see the selling prices of the stock, and understand how they travel within the deviation range: The bars shown on the graphs represent the standard deviation as calculated by Excel. We can see visually that not only does Stock B represent a greater risk with the larger deviation, but also many of the stock prices fall below our deviation, representing further risk to the investor. With funds to invest as a finance manager, Stock A represents a lower risk investment. There's more... Standard deviation can be calculated for almost any data set. For this recipe, we calculated deviation over the course of one year; however, if we expand the data to include multiple years we can further determine long-term risk. While Stock B represents high short-term risk, in the long-term analysis, Stock B may present as a less risky investment. Combining standard deviation with a five-number summary analysis, we can further gain risk and performance information.

0
0
4923

article-image-designing-user-security-oracle-peoplesoft-applications

Packt

04 Jul 2011

12 min read

Designing User Security for Oracle PeopleSoft Applications

Packt

04 Jul 2011

12 min read

Understanding User security Before we get into discussing the PeopleSoft security, let's spend some time trying to set the context for user security. Whenever we think of a complex system like PeopleSoft Financial applications with potentially hundreds of users, the following are but a few questions that face us: Should a user working in billing group have access to transactions, such as vouchers and payments, in Accounts Payable? Should a user who is a part of North America business unit have access to the data belonging to the Europe business unit? Should a user whose job involves entering vouchers be able to approve and pay them as well? Should a data entry clerk be able to view departmental budgets for the organization? These questions barely scratch the surface of the complex security considerations of an organization. Of course, there is no right or wrong answer for such questions, as every organization has its own unique security policies. What is more important is the fact that we need a mechanism that can segregate the access to system features. We need to enforce appropriate controls to ensure users can access only the features they need. Implementation challenge Global Vehicles' Accounts Payable department has three different types of users – Mary, who is the AP manager; Amy, who is the AP Supervisor; and Anton, who is the AP Clerk. These three types of users need to have the following access to PeopleSoft features: User typeAccess to featureDescriptionAP ClerkVoucher entryA clerk should be able to access system pages to enter vouchers from various vendors.AP SupervisorVoucher entry Voucher approval Voucher postingA supervisor also needs to have access to enter vouchers. He/she should review and approve each voucher entered by the clerk. Also, the supervisor should be able to execute the Voucher Post process that creates accounting entries for vouchers.AP ManagerPay cycle Voucher approvalAP manager should be the only one who can execute the Pay Cycle (a process that prints checks to issue payments to vendors). Manager (in addition to the Supervisor) should also have the authority to approve vouchers. Note that this is an extremely simplified scenario that does not really include all the required features in Accounts Payable. Solution We will design a security matrix that uses distinct security roles. We'll configure permission lists, roles, and user profiles to limit user access to required system features. PeopleSoft security matrix is a three-level structure consisting of Permission lists (at the bottom), Roles (in the middle) and User profiles (at the top). The following illustration shows how it is structured: We need to create a User Profile for each user of the system. This user profile can have as many Roles as needed. For example, a user can have roles of Billing Supervisor and Accounts Receivable Payment Processor, if he/she approves customer invoices as well as processes customer payments. Thus, the number of roles that a user should be granted depends entirely on his/her job responsibilities. Each role can have multiple permission lists. A Permission list determines which features can be accessed by a Role. We can specify which pages can be accessed, the mode in which they can be accessed (read only/add/update) in a permission list. In a nutshell, we can think of a Role as a grouping of system features that a user needs to access, while a Permission list defines the nature of access to those system features. Expert tip Deciding how many and which roles should be created needs quite a bit of thinking about. How easy will it be to maintain them in future? Think of the scenarios where a function needs to be removed from a role and added to another. How easy would it be to do so? As a rule of thumb, you should map system features to roles in such a way that they are non-overlapping. Similarly, map system pages to permission lists so that they are mutually exclusive. This simplifies user security maintenance to a large extent. Note that although it is advisable, it may not always be possible. Organizational requirements are sometimes too complicated to achieve this. However, a PeopleSoft professional should try to build a modular security design to the extent possible. Now, let's try to design our security matrix for the hypothetical scenario presented previously and test our thumb rule of mutually exclusive roles and permission lists. What do you observe as far as required system features for a role are concerned? You are absolutely right if you thought that some of system features (such as Voucher entry) are common across roles. Which roles should we design for this situation? Going by the principle of mutually exclusive roles, we can map system features to the permission lists (and in turn roles) without overlapping them. We'll denote our roles by the prefix 'RL' and permission lists by the prefix 'PL'. Thus, the mapping may look something like this: RolePermission listSystem featureRL_Voucher_EntryPL_Voucher_EntryVoucher EntryRL_Voucher_ApprovalPL_Voucher_ApprovalVoucher ApprovalRL_Voucher_PostingPL_Voucher_PostingVoucher PostingRL_Pay_CyclePL_Pay_CyclePay Cycle So, now we have created the required roles and permission lists, and attached appropriate permission lists to each of the roles. In this example, due to the simple scenario, each role has only a single permission list assigned to it. Now as the final step, we'll assign appropriate roles to each user's User profile. UserRoleSystem feature accessedMaryRL_Voucher_ApprovalVoucher ApprovalRL_Pay_CyclePay CycleAmyRL_Voucher_EntryVoucher EntryRL_Voucher_ApprovalVoucher ApprovalRL_Voucher_PostingVoucher PostingAntonRL_Voucher_EntryVoucher Entry Now, as you can see, each user has access to the appropriate system feature through the roles and in turn, permission lists attached to their user profiles. Can you think of the advantage of our approach? Let's say that a few months down the line, it is decided that Mary (AP Manager) should be the only one approving vouchers, while Amy (AP Supervisor) should also have the ability to execute pay cycles and issue payments to vendors. How can we accommodate this change? It's quite simple really – we'll remove the role RL_Voucher_Approval from Amy's user profile and add the role RL_Pay_Cycle to her profile. So now the security matrix will look like this: UserRoleSystem feature accessedMaryRL_Voucher_ApprovalVoucher ApprovalRL_Pay_CyclePay CycleAmyRL_Voucher_EntryVoucher EntryRL_Pay_CyclePay CycleRL_Voucher_PostingVoucher PostingAntonRL_Voucher_EntryVoucher Entry Thus, security maintenance becomes less cumbersome when we design roles and permission lists with non-overlapping system features. Of course, this comes with a downside as well. This approach results in a large number of roles and permission lists, thereby increasing the initial configuration effort. The solution that we actually design for an organization needs to balance these two objectives. Expert tip Having too many permission lists assigned to a User Profile can adversely affect the system performance. PeopleSoft recommends 10-20 permission lists per user. Configuring permission lists Follow this navigation to configure permission lists: PeopleTools | Security | Permissions & Roles | Permission Lists The following screenshot shows the General tab of the Permission List set up page. We can specify the permission list description on this tab. We'll go through some of the important configuration activities for the Voucher Entry permission list discussed previously. Can Start Application Server?: Selecting this field enables a user with this permission list to start PeopleSoft application servers (all batch processes are executed on this server). A typical system user does not need this option. Allow Password to be Emailed?: Selecting this field enables users to receive forgotten passwords through e-mail. Leave the field unchecked to prevent unencrypted passwords from being sent in e-mails. Time-out Minutes – Never Time-out and Specific Time-out: These fields determine the number of minutes of inactivity after which the system automatically logs out the user with this permission list. The following screenshot shows the Pages tab of the permission list page. This is the most important place where we specify the menus, components, and ultimately the pages that a user can access. In PeopleSoft parlance, a component means collection of pages that are related to each other. In the previous screenshot, you are looking at a page. You can also see other related pages used to configure permission lists – General, PeopleTools, Process, and so on. All these pages constitute a component. A menu is a collection of various components. It is typically related to a system feature, such as 'Enter Voucher Information' as you can see in the screenshot. Thus, to grant a page access, we need to enter its component and menu details. Expert tip In order to find out the component and menu for a page, press CTRL+J when it is open in the internet browser. Menu Name: This is where we need to specify all the menus to which a user needs to have access. Note that a permission list can grant access to multiple menus, but our simple example includes only one system feature (Voucher entry) and in turn, only one menu. Click the + button to add and select more menus. Edit Component hyperlink: Once we select a menu, we need to specify which components under it should be accessed by the permission list. Click the Edit Components hyperlink to proceed. The system opens the Component Permissions page, as shown in the following screenshot: In this page, the system shows all components under the selected menu. The previous screenshot shows only a part of the component list under the 'Enter Voucher Information' menu. Voucher entry pages for which we need to grant access exist under a component named VCHR_EXPRESS. Click the Edit Pages hyperlink to grant access to specific pages under a given component. The system opens the Page Permissions page, as shown in the following screenshot: The given screenshot shows the partial list of pages under the Voucher Entry component. Panel Item Name: This field shows the page name to which access is to be granted. Authorized?: In order to enable a user to access a page, select this checkbox. As you can see, we have authorized this permission list to access six pages in this component. Display Only: Clicking this checkbox allows the user to get read-only access for the given page. He/she cannot make any changes to the data on this page. Add: Selecting this checkbox enables the user to add a new transaction (in this case new vouchers). Update/Display: Selecting this checkbox enables the user to retrieve the current effective dated row. He/she can also add future effective dated rows in addition to modifying them. Update/Display All: This option gives the user all of the privileges of the Update/Display option. In addition he/she can retrieve past effective dated rows as well. Correction: This option enables the user to perform all the possible operations; that is, to retrieve, modify or add past, current, and future effective dated rows. Effective dates are usually relevant for master data set ups. It drives when a particular value comes into effect. A vendor may be set up with effective date of 1/1/2001, so thatit comes into effect from that date. Now assume that its name is slated to change on 1/1/2012. However, we can add a new vendor row with the new name and the appropriate effective date. The system automatically starts using the new name from 1/1/2012. Note that there can be multiple future effective dated rows, but only one current row. The next tab PeopleTools contains configuration options that are more relevant for technical developers. As we are concentrating on business users of the system, we'll not discuss them. Click the Process tab. As shown in the following screenshot, this tab is used to configure options for Process groups. A process group is a collection of batch processes belonging to a specific internal department or a business process. For example, PeopleSoft delivers a process group ARALL that includes all Accounts Receivable batch processes. PeopleSoft delivers various pre-configured process groups; however, we can create our own process groups depending on the organization's requirements. Click the Process Group Permissions hyperlink. The system opens a new page where we can select as many process groups as needed. When a process group is selected for a permission list, it enables the users to execute batch and online processes that are part of it. The following screenshot shows the Sign-on Times tab. This tab controls the time spans when a user with this permission list can sign-on to the system. We can enforce specific days or specific time spans for a particular day when users can sign-on. In the case of our permission list, there are no such limits and users with this permission list will be able to sign-on anytime on all days of the week. The next tab on this page is the Component Interface. The component interface is a PeopleSoft utility that automates bulk data entry into PeopleSoft pages. We can select component interface values on this tab, so that users with this permission list have access rights to use them. Due to the highly technical nature of the activities involved, we will not discuss the Web Libraries, Web Services, Mass change, Links, and Audit tabs. Oracle offers exhaustive documentation on PeopleSoft applications at the following here. The next important tab on the permission list page is Query. On this tab, the system shows two hyperlinks: Access Group Permissions and Query Profile. Click the Access Group Permissions hyperlink. The system opens the Permission List Access Groups page, as shown in the next screenshot. This page controls which database tables can be accessed by users to create database queries, using a PeopleSoft tool called Query Manager. Tree Name: A tree is a hierarchical structure of database tables. As shown in the screenshot, the tree QUERY_TREE_AP groups all AP tables. Access Group: Each tree has multiple nodes called access groups. These access groups are just logical groups of tables within a tree. In the screenshot, VOUCHERS is a group of voucher-related tables within the AP table tree. With this configuration, users will be able to create database queries on voucher tables in AP. Using the Query Profile hyperlink, we can set various options that control how users can create queries (such as if he/she can use joins, unions, and so on in queries, how many database rows can be fetched by the query, if the user can copy the query output to Microsoft Excel, and so on).

0
0
7038

article-image-overview-oracle-peoplesoft-commitment-control

Packt

29 Jun 2011

9 min read

An Overview of Oracle PeopleSoft Commitment Control

Packt

29 Jun 2011

9 min read

0
0
8730

article-image-excel-2010-financials-using-graphs-analysis

Packt

28 Jun 2011

6 min read

Excel 2010 Financials: Using Graphs for Analysis

Packt

28 Jun 2011

6 min read

Introduction Graphing is one of the most effective ways to display datasets, financial scenarios, and statistical functions in a way that can be understood easily by the users. When you give an individual a list of 40 different numbers and ask him or her to draw a conclusion, it is not only difficult, it may be impossible without the use of extra functions. However, if you provide the same individual a graph of the numbers, they will most likely be able to notice trending, dataset size, frequency, and so on. Despite the effectiveness of graphing and visual modeling, financial and statistical graphing is often overlooked in Excel 2010 due to difficulty, or lack of native functions. In this article, you will learn to not only add reusable methods to automate graph production, but also how to create graphs and graphing sets that are not native to Excel. You will learn to use box and whisker plots, histograms to demonstrate frequency, stem and leaf plots, and other methods to graph financial ratios and scenarios. Charting financial frequency trending with a histogram Frequency calculations are used throughout financial analysis, statistics, and other mathematical representations to determine how often an event has occurred. Determining the frequency of financial events in a transaction list can assist in determining the popularity of an item or product, the future likelihood of an event to reoccur, or frequency of profitability of an organization. Excel, however, does not create histograms by default. In this recipe, you will learn how to use several functions including bar charts and FREQUENCY functions to create a histogram frequency chart within Excel to determine profitability of an entity. Getting ready When plotting histogram frequency, we are using frequency and charting to determine the continued likelihood of an event from past data visually. Past data can be flexible in terms of what we are trying to determine; in this instance, we will use the daily net profit (Sale income Versus Gross expenses) for a retail store. The daily net profit numbers for one month are as follows: $150, $237, -$94.75, $1,231, $876, $455, $349, -$173, -$34, -$234, $110, $83, -$97, -$129, $34, $456, $1010, $878, $211, -$34, -$142, -$87, $312 How to do it... Utilizing the profit numbers from above, we will begin by adding the data to the Excel worksheet: Within Excel, enter the daily net profit numbers into column A starting on row 2 until all the data has been entered: We must now create the boundaries to be used within the histogram. The boundary numbers will be the highest and the lowest number thresholds that will be included within the graph. The boundaries to be used in this instance will be of $1500, and -$1500. These boundaries will encompass our entire dataset, and it will allow padding on the higher and lower ends of the data to allow for variation when plotting larger datasets encompassing multiple months or years worth of profit. We must now create bins that we will chart against the frequency. The bins will be the individual data-points that we want to determine the frequency against. For instance, one bin will be $1500, and we will want to know how often the net profit of the retail location falls within the range of $1500. The smaller the bins chosen, the larger the chart area. You will want to choose a bin size that will accurately reflect the frequency of your dataset without creating a large blank space. Bin size will change depending on the range of data to be graphed. Enter the chosen bin number into the worksheet in Column C. The bins will be a $150 difference from the previous bin.The bin sizes needed to include an appropriate range in order to illustrate the expanse of the dataset completely. In order to encompass all appropriate bin sizes, it is necessary to begin with the largest negative number, and increment to the largest positive number: The last component for creating the frequency histogram actually determines the frequency of net profit to the designated bins. For this, we will use the Excel function FREQUENCY. Select rows D2 through D22, and enter the following formula: =FREQUENCY(A:A,C2:C22) After entering the formula, press Shift + Ctrl + Enter to finalize the formula as an array formula.Excel has now displayed the frequency of the net profit for each of the designated bins: The information for the histogram is now ready. We will now be able to create the actual histogram graph. Select rows C2 through D22. With the rows selected, using the Excel ribbon, choose Insert | Column: From the Column drop-down, choose the Clustered Column chart option: Excel will now attempt to determine the plot area, and will present you with a bar chart. The chart that Excel creates does not accurately display the plot areas, due to Excel being unable to determine the correct data range: Select the chart. From the Ribbon, choose Select Data: From the select Data Source window that has appeared, you will change the Horizontal Axis to include the Bins column (column C), and the Legend Series to include the Frequency (column D). Excel will also create two series entries within the Legend Series panel; remove Series2 and select OK: Excel now presents the histogram graph of net profit frequency: Using the Format graph options on the Excel Ribbon, reduce the bar spacing and adjust the horizontal label to reduce the chart area to your specifications: Using the histogram for financial analysis, we can now determine that the major trend of the retail location quickly and easily, which maintains a net profit within $0 - $150, while maintaining a positive net profit throughout the month. How it works... While a histogram graph/chart is not native to Excel, we were able to use a bar/column chart to plot the frequency of net profit within specific bin ranges. The FREQUENCY function of Excel follows the following format: =FREQUENCY(Data Range to plot, Bins to plot within) It is important to note that within the data range, we chose the range A:A. This range includes all of the data within the A column. If arbitrary or unnecessary data unrelated to the histogram were added into column A, this range would include them. Do not allow unnecessary data to be added to column A, or use a limited range such as A1:A5 if your data was only included in the first five cells of column A. We entered the formula by pressing Shift + Ctrl + Enter in order to submit the formula as an array formula. This allows Excel to calculate individual information within a large array of data. The graph modifications allowed the bins to show frequency in the vertical axis, or indicate how many times a specific number range was achieved. The horizontal axis displayed the actual bins. There's more... The amount of data for this histogram was limited; however, its usefulness was already evident. When this same charting recipe is used to chart a large dataset (for example, multiple year data), the histogram becomes even more useful in displaying trends.

0
0
12439

article-image-excel-2010-financials-adding-animations-excel-graphs

Packt

28 Jun 2011

3 min read

Excel 2010 Financials: Adding Animations to Excel Graphs

Packt

28 Jun 2011

3 min read

Excel 2010 Financials Cookbook Powerful techniques for financial organization, analysis, and presentation in Microsoft Excel Getting ready We will start this recipe with a single column graph demonstrating profitability. The dataset for this graph is cell A1. Cell A1 has the formula =A2/100. Cell A2 contains the number 1000: How to do it... Press Alt + F11 on the keyboard to open the Excel Visual Basic Editor (VBE). Once in the VBE, choose Insert | Module from the file menu. Enter the following code: Sub Animate() Dim x As Integer x = 0 Range("A2").Value = x While x < 1000 x = x + 1 Range("A2").Value = Range("A2").Value + 1 For y = 1 To 500000 Next DoEvents Wend End Sub The code should be formatted within the VBE code window as follows: Save your work and close the VBE. Once back at the worksheet, from the Ribbon, choose View | Macros | View Macros. Select the Animate macro and choose Run. The graph for profitability will now drop to 0 and slowly rise to account for the full profitability. How it works... The graph was set to use the data within cell A1 as the dataset to graph. While there was a value within the cell, the graph shows the column and corresponding value. If you were to manually change the value of cell A1 and press enter, the graph would also change to the new value. It is this update event that we harness within the VBA macro that was added. Sub Animate() Dim x As Integer x = 0 Here we first declare and set any variables that we will need. The value of x will hold the overall profit: Range("A2").Value = x While x < 1000 x = x + 1 Range("A2").Value = Range("A2").Value + 1 For y = 1 To 500000 Next DoEvents Wend Next, we choose cell A2 from the worksheet and set its value to x (which begins as 0). Since cell A1 is set to be A2/100, A1 (which is the graph dataset) is now zero. Using a "while" clause in Excel, we take x and add 1 to its value, then pause a moment using DoEvents, to allow Excel to update the graph, then we repeat adding another value of 1. This is repeated until x is equal to 1000 which when divided by 100 as in the formula for A1, becomes 10. Now, the profitability we end with in our animation graph is 10. End Sub There's more... We can change the height of the profitability graph by simply changing the 1000 in the While x < 1000 line. By setting this number to 900, the graph will only grow to the profit level of nine. Further resources on this subject: Excel 2010 Financials: Using Graphs for Analysis [Article] Load Testing Using Visual Studio 2008 [Article] How to Manage Content in a List in Microsoft Sharepoint [Article] Sage ACT! 2011: Creating a Quick Report [Article] Tips and Tricks: Report Page in IBM Cognos 8 Report Studio [Article]

0
0
3402

article-image-pentaho-data-integration-4-working-complex-data-flows

Packt

27 Jun 2011

7 min read

Pentaho Data Integration 4: working with complex data flows

Packt

27 Jun 2011

7 min read

Joining two or more streams based on given conditions There are occasions where you will need to join two datasets. If you are working with databases, you could use SQL statements to perform this task, but for other kinds of input (XML, text, Excel), you will need another solution. Kettle provides the Merge Join step to join data coming from any kind of source. Let's assume that you are building a house and want to track and manage the costs of building it. Before starting, you prepared an Excel file with the estimated costs for the different parts of your house. Now, you are given a weekly file with the progress and the real costs. So, you want to compare both to see the progress. Getting ready To run this recipe, you will need two Excel files, one for the budget and another with the real costs. The budget.xls has the estimated starting date, estimated end date, and cost for the planned tasks. The costs.xls has the real starting date, end date, and cost for tasks that have already started. You can download the sample files from here. How to do it... Carry out the following steps: Create a new transformation. Drop two Excel input steps into the canvas. Use one step for reading the budget information (budget.xls file) and the other for reading the costs information (costs.xls file). Under the Fields tab of these steps, click on the Get fields from header row... button in order to populate the grid automatically. Apply the format dd/MM/yyyy to the fields of type Date and $0.00 to the fields with costs. Add a Merge Join step from the Join category, and create a hop from each Excel input step toward this step. The following diagram depicts what you have so far: Configure the Merge Join step, as shown in the following screenshot: If you do a preview on this step, you will obtain the result of the two Excel files merged. In order to have the columns more organized, add a Select values step from the Transform category. In this new step, select the fields in this order: task, starting date (est.), starting date, end date (est.), end date, cost (est.), cost. Doing a preview on the last step, you will obtain the merged data with the columns of both Excel files interspersed, as shown in the following screenshot: How it works... In the example, you saw how to use the Merge Join step to join data coming from two Excel files. You can use this step to join any other kind of input. In the Merge Join step, you set the name of the incoming steps, and the fields to use as the keys for joining them. In the recipe, you joined the streams by just a single field: the task field. The rows are expected to be sorted in an ascending manner on the specified key fields. There's more... In the example, you set the Join Type to LEFT OUTER JOIN. Let's see explanations of the possible join options: Interspersing new rows between existent rows In most Kettle datasets, all rows share a common meaning; they represent the same kind of entity, for example: In a dataset with sold items, each row has data about one item In a dataset with the mean temperature for a range of days in five different regions, each row has the mean temperature for a different day in one of those regions In a dataset with a list of people ordered by age range (0-10, 11-20, 20-40, and so on), each row has data about one person Sometimes, there is a need of interspersing new rows between your current rows. Taking the previous examples, imagine the following situations: In the sold items dataset, every 10 items, you have to insert a row with the running quantity of items and running sold price from the first line until that line. In the temperature's dataset, you have to order the data by region and the last row for each region has to have the average temperature for that region. In the people's dataset, for each age range, you have to insert a header row just before the rows of people in that range. In general, the rows you need to intersperse can have fixed data, subtotals of the numbers in previous rows, header to the rows coming next, and so on. What they have in common is that they have a different structure or meaning compared to the rows in your dataset. Interspersing these rows is not a complicated task, but is a tricky one. In this recipe, you will learn how to do it. Suppose that you have to create a list of products by category. For each category, you have to insert a header row with the category description and the number of products inside that category. The final result should be as follows: Getting ready This recipe uses an outdoor database with the structure shown in Appendix, Data Structures (Download here). As source, you can use a database like this or any other source, for example a text file with the same structure. How to do it... Carry out the following steps: Create a transformation, drag into the canvas a Table Input step, select the connection to the outdoor database, or create it if it doesn't exist. Then enter the following statement: SELECT category , desc_product FROM products p ,categories c WHERE p.id_category = c.id_category ORDER by category Do a preview of this step. You already have the product list! Now, you have to create and intersperse the header rows. In order to create the headers, do the following: From the Statistics category, add a Group by step and fill in the grids, as shown in the following screenshot: From the Scripting category, add a User Defined Java Expression step, and use it to add two fields: The first will be a String named desc_product, with value ("Category: " + category).toUpperCase(). The second will be an Integer field named order with value 1. Use a Select values step to reorder the fields as category, desc_product, qty_product, and order. Do a preview on this step; you should see the following result: Those are the headers. The next step is mixing all the rows in the proper order. Drag an Add constants step into the canvas and a Sort rows step. Link them to the other steps as shown: Use the Add constants to add two Integer fields: qty_prod and order. As Value, leave the first field empty, and type 2 for the second field. Use the Sort rows step for sorting by category, order, and desc_product. Select the last step and do a preview. You should see the rows exactly as shown in the introduction. How it works... When you have to intersperse rows between existing rows, there are just four main tasks to do, as follows: Create a secondary stream that will be used for creating new rows. In this case, the rows with the headers of the categories. In each stream, add a field that will help you intersperse rows in the proper order. In this case, the key field was named order. Before joining the two streams, add, remove, and reorder the fields in each stream to make sure that the output fields in each stream have the same metadata. Join the streams and sort by the fields that you consider appropriate, including the field created earlier. In this case, you sorted by category, inside each category by the field named order and finally by the products description. Note that in this case, you created a single secondary stream. You could create more if needed, for example, if you need a header and footer for each category.

0
0
8270

article-image-pentaho-data-integration-4-understanding-data-flows

Packt

24 Jun 2011

12 min read

Pentaho Data Integration 4: Understanding Data Flows

Packt

24 Jun 2011

12 min read

Pentaho Data Integration 4 Cookbook Over 70 recipes to solve ETL problems using Pentaho Kettle Read more about this book This article by Adrián Sergio Pulvirenti and María Carina Roldán, authors of Pentaho Data Integration 4 Cookbook, focuses on the different ways for combining, splitting, or manipulating streams or flows of data using Kettle transformations. The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation. In this article, we will cover: Splitting a stream into two or more streams based on a condition Merging rows from two streams with the same or different structure Comparing two streams and generating differences Generating all possible pairs formed from two datasets (For more resources on this subject, see here.) Introduction The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation. When a transformation is launched, all its steps are started. During the execution, the steps work simultaneously reading rows from the incoming hops, processing them, and delivering them to the outgoing hops. When there are no more rows left, the execution of the transformation ends. The dataset that flows from step to step is not more than a set of rows all having the same structure or metadata. This means that all rows have the same number of columns, and the columns in all rows have the same type and name. Suppose that you have a single stream of data and that you apply the same transformations to all rows, that is, you have all steps connected in a row one after the other. In other words, you have the simplest of the transformations from the point of view of its structure. In this case, you don't have to worry much about the structure of your data stream, nor the origin or destination of the rows. The interesting part comes when you face other situations, for example: You want a step to start processing rows only after another given step has processed all rows You have more than one stream and you have to combine them into a single stream You have to inject rows in the middle of your stream and those rows don't have the same structure as the rows in your dataset With Kettle, you can actually do this, but you have to be careful because it's easy to end up doing wrong things and getting unexpected results or even worse: undesirable errors. With regard to the first example, it doesn't represent a default behavior due to the parallel nature of the transformations as explained earlier. There are two steps however, that might help, which are as follows: Blocking Step: This step blocks processing until all incoming rows have been processed. Block this step until steps finish: This step blocks processing until the selected steps finish. Both these steps are in the Flow category. This and the next article on Working with Complex Data Flows focuses on the other two examples and some similar use cases, by explaining the different ways for combining, splitting, or manipulating streams of data. Splitting a stream into two or more streams based on a condition In this recipe, you will learn to use the Filter rows step in order to split a single stream into different smaller streams. In the There's more section, you will also see alternative and more efficient ways for doing the same thing in different scenarios. Let's assume that you have a set of outdoor products in a text file, and you want to differentiate the tents from other kind of products, and also create a subclassification of the tents depending on their prices. Let's see a sample of this data: id_product,desc_product,price,category1,"Swedish Firesteel - Army Model",19,"kitchen"2,"Mountain House #10 Can Freeze-Dried Food",53,"kitchen"3,"Lodge Logic L9OG3 Pre-Seasoned 10-1/2-Inch RoundGriddle",14,"kitchen"... Getting ready To run this recipe, you will need a text file named outdoorProducts.txt with information about outdoor products. The file contains information about the category and price of each product. How to do it... Carry out the following steps: Create a transformation. Drag into the canvas a Text file input step and fill in the File tab to read the file named outdoorProducts.txt. If you are using the sample text file, type , as the Separator. Under the Fields tab, use the Get Fields button to populate the grid. Adjust the entries so that the grid looks like the one shown in the following screenshot: Now, let's add the steps to manage the flow of the rows. To do this, drag two Filter rows steps from the Flow category. Also, drag three Dummy steps that will represent the three resulting streams. Create the hops, as shown in the following screenshot. When you create the hops, make sure that you choose the options according to the image: Result is TRUE for creating a hop with a green icon, and Result is FALSE for creating a hop with a red icon in it. Double-click on the first Filter rows step and complete the condition, as shown in the following screenshot: Double-click on the second Filter rows step and complete the condition with price < 100. You have just split the original dataset into three groups. You can verify it by previewing each Dummy step. The first one has products whose category is not tents; the second one, the tents under 100 US$; and the last group, the expensive tents; those whose price is over 100 US$. The preview of the last Dummy step will show the following: How it works... The main objective in the recipe is to split a dataset with products depending on their category and price. To do this, you used the Filter rows step. In the Filter rows setting window, you tell Kettle where the data flows to depending on the result of evaluating a condition for each row. In order to do that, you have two list boxes: Send 'true' data to step and Send 'false' data to step. The destination steps can be set by using the hop properties as you did in the recipe. Alternatively, you can set them in the Filter rows setting dialog by selecting the name of the destination steps from the available drop-down lists. You also have to enter the condition. The condition has the following different parts: The upper textbox on the left is meant to negate the condition. The left textbox is meant to select the field that will be used for comparison. Then, you have a list of possible comparators to choose from. On the right, you have two textboxes: The upper textbox for comparing against a field and the bottom textbox for comparing against a constant value. Also, you can include more conditions by clicking on the Add Condition button on the right. If you right-click on a condition, a contextual menu appears to let you delete, edit, or move it. In the first Filter rows step of the recipe, you typed a simple condition: You compared a field (category) with a fixed value (tents) by using the equal (=) operator. You did this to separate the tents products from the others. The second filter had the purpose of differentiating the expensive and the cheap tents. There's more... You will find more filter features in the following subsections. Avoiding the use of Dummy steps In the recipe, we assumed that you wanted all three groups of products for further processing. Now, suppose that you only want the cheapest tents and you don't care about the rest. You could use just one Filter rows step with the condition category = tents AND price < 100, and send the 'false' data to a Dummy step, as in shown in the following diagram: The rows that don't meet the condition will end at the Dummy step. Although this is a very commonly used solution for keeping just the rows that meet the conditions, there is a simpler way to implement it. When you create the hop from the Filter rows toward the next step, you are asked for the kind of hop. If you choose Main output of step, the two options Send 'true' data to step and Send 'false' data to step will remain empty. This will cause two things: Only the rows that meet the condition will pass. The rest will be discarded. Comparing against the value of a Kettle variable The recipe above shows you how to configure the condition in the Filter rows step to compare a field against another field or a constant value, but what if you want to compare against the value of a Kettle variable? Let's assume, for example, you have a named parameter called categ with kitchen as Default Value. As you might know, named parameters are a particular kind of Kettle variable. You create the named parameters under the Parameter tab from the Settings option of the Edit menu. To use this variable in a condition, you must add it to your dataset in advance. You do this as follows: Add a Get Variables step from the Job category. Put it in the stream after the Text file input step and before the Filter Rows step; use it to create a new field named categ of String type with the value ${categ} in the Variable column. Now, the transformation looks like the one shown in the following screenshot: After this, you can set the condition of the first Filter rows step to category = categ, selecting categ from the listbox of fields to the right. This way, you will be filtering the kitchen products. If you run the transformation and set the parameter to tents, you will obtain similar results to those that were obtained in the main recipe. Avoiding the use of nested Filter Rows steps Suppose that you want to compare a single field against a discrete and short list of possible values and do different things for each value in that list. In this case, you can use the Switch / Case step instead of nested Filter rows steps. Let's assume that you have to send the rows to different steps depending on the category. The best way to do this is with the Switch / Case step. This way you avoid adding one Filter row step for each category. In this step, you have to select the field to be used for comparing. You do it in the Field name to switch listbox. In the Case values grid, you set the Value—Target step pairs. The following screenshot shows how to fill in the grid for our particular problem: The following are some considerations about this step: You can have multiple values directed to the same target step You can leave the value column blank to specify a target step for empty values You have a listbox named Default target step to specify the target step for rows that do not match any of the case values You can only compare with an equal operator If you want to compare against a substring of the field, you could enable the Use string contains option and as Case Value, type the substring you are interested in. For example, if for Case Value, you type tent_ then all categories containing tent_ such as tent_large, tent_small, or best_tents will be redirected to the same target step. Overcoming the difficulties of complex conditions There will be situations where the condition is too complex to be expressed in a single Filter rows step. You can nest them and create temporary fields in order to solve the problem, but it would be more efficient if you used the Java Filter or User Defined Java Expression step as explained next. You can find the Java Filter step in the Flow category. The difference compared to the Filter Rows step is that in this step, you write the condition using a Java expression. The names of the listboxes—Destination step for matching rows (optional) and Destination step for non-matching rows (optional)—differ from the names in the Filter rows step, but their purpose is the same. As an example, the following are the conditions you used in the recipe rewritten as Java expressions: category.equals("tents") and price < 100. These are extremely simple, but you can write any Java expression as long as it evaluates to a Boolean result. If you can't guarantee that the category will not be null, you'd better invert the first expression and put "tents".equals(category) instead. By doing this, whenever you have to check if a field is equal to a constant, you avoid an unexpected Java error. Finally, suppose that you have to split the streams simply to set some fields and then join the streams again. For example, assume that you want to change the category as follows: Doing this with nested Filter rows steps leads to a transformation like the following: You can do the same thing in a simpler way: Replace all the steps but the Text file input with a User Defined Java Expression step located in the Scripting category. In the setting window of this step, add a row in order to replace the value of the category field: As New field and Replace value type category. As Value type select String. As Java expression, type the following: (category.equals("tents"))?(price<100?"cheap_tents":"expensive_tents"):category The preceding expression uses the Java ternary operator ?:. If you're not familiar with the syntax, think of it as shorthand for the if-then-else statement. For example, the inner expression price<100?"cheap_tents":"expensive_tents" means if (price<100) then return "cheap_tents" else return "expensive_tents". Do a preview on this step. You will see something similar to the following:

0
0
15640

Packt

21 Jun 2011

7 min read

Segmenting images in OpenCV

Packt

21 Jun 2011

7 min read

OpenCV 2 Computer Vision Application Programming Cookbook Over 50 recipes to master this library of programming functions for real-time computer vision Read more about this book OpenCV (Open Source Computer Vision) is an open source library containing more than 500 optimized algorithms for image and video analysis. Since its introduction in 1999, it has been largely adopted as the primary development tool by the community of researchers and developers in computer vision. OpenCV was originally developed at Intel by a team led by Gary Bradski as an initiative to advance research in vision and promote the development of rich, vision-based CPU-intensive applications. In the previous article by Robert Laganière, author of OpenCV 2 Computer Vision Application Programming Cookbook, we took a look at image processing using morphological filters. In this article we will see how to segment images using watersheds and GrabCut algorithm. (For more resources related to this subject, see here.) Segmenting images using watersheds The watershed transformation is a popular image processing algorithm that is used to quickly segment an image into homogenous regions. It relies on the idea that when the image is seen as a topological relief, homogeneous regions correspond to relatively flat basins delimitated by steep edges. As a result of its simplicity, the original version of this algorithm tends to over-segment the image which produces multiple small regions. This is why OpenCV proposes a variant of this algorithm that uses a set of predefined markers which guide the definition of the image segments. How to do it... The watershed segmentation is obtained through the use of the cv::watershed function. The input to this function is a 32-bit signed integer marker image in which each non-zero pixel represents a label. The idea is to mark some pixels of the image that are known to certainly belong to a given region. From this initial labeling, the watershed algorithm will determine the regions to which the other pixels belong. In this recipe, we will first create the marker image as a gray-level image, and then convert it into an image of integers. We conveniently encapsulated this step into a WatershedSegmenter class: class WatershedSegmenter { private: cv::Mat markers;public: void setMarkers(const cv::Mat& markerImage) { // Convert to image of ints markerImage.convertTo(markers,CV_32S); } cv::Mat process(const cv::Mat &image) { // Apply watershed cv::watershed(image,markers); return markers; } The way these markers are obtained depends on the application. For example, some preprocessing steps might have resulted in the identification of some pixels belonging to an object of interest. The watershed would then be used to delimitate the complete object from that initial detection. In this recipe, we will simply use the binary image used in the previous article (OpenCV: Image Processing using Morphological Filters) in order to identify the animals of the corresponding original image. Therefore, from our binary image, we need to identify pixels that certainly belong to the foreground (the animals) and pixels that certainly belong to the background (mainly the grass). Here, we will mark foreground pixels with label 255 and background pixels with label 128 (this choice is totally arbitrary, any label number other than 255 would work). The other pixels, that is the ones for which the labeling is unknown, are assigned value 0. As it is now, the binary image includes too many white pixels belonging to various parts of the image. We will then severely erode this image in order to retain only pixels belonging to the important objects: // Eliminate noise and smaller objectscv::Mat fg;cv::erode(binary,fg,cv::Mat(),cv::Point(-1,-1),6); The result is the following image: Note that a few pixels belonging to the background forest are still present. Let's simply keep them. Therefore, they will be considered to correspond to an object of interest. Similarly, we also select a few pixels of the background by a large dilation of the original binary image: // Identify image pixels without objectscv::Mat bg;cv::dilate(binary,bg,cv::Mat(),cv::Point(-1,-1),6);cv::threshold(bg,bg,1,128,cv::THRESH_BINARY_INV); The resulting black pixels correspond to background pixels. This is why the thresholding operation immediately after the dilation assigns to these pixels the value 128. The following image is then obtained: These images are combined to form the marker image: // Create markers imagecv::Mat markers(binary.size(),CV_8U,cv::Scalar(0));markers= fg+bg; Note how we used the overloaded operator+ here in order to combine the images. This is the image that will be used as input to the watershed algorithm: The segmentation is then obtained as follows: // Create watershed segmentation objectWatershedSegmenter segmenter;// Set markers and processsegmenter.setMarkers(markers);segmenter.process(image); The marker image is then updated such that each zero pixel is assigned one of the input labels, while the pixels belonging to the found boundaries have value -1. The resulting image of labels is then: The boundary image is: How it works... As we did in the preceding recipe, we will use the topological map analogy in the description of the watershed algorithm. In order to create a watershed segmentation, the idea is to progressively flood the image starting at level 0. As the level of "water" progressively increases (to levels 1, 2, 3, and so on), catchment basins are formed. The size of these basins also gradually increase and, consequently, the water of two different basins will eventually merge. When this happens, a watershed is created in order to keep the two basins separated. Once the level of water has reached its maximal level, the sets of these created basins and watersheds form the watershed segmentation. As one can expect, the flooding process initially creates many small individual basins. When all of these are merged, many watershed lines are created which results in an over-segmented image. To overcome this problem, a modification to this algorithm has been proposed in which the flooding process starts from a predefined set of marked pixels. The basins created from these markers are labeled in accordance with the values assigned to the initial marks. When two basins having the same label merge, no watersheds are created, thus preventing the oversegmentation. This is what happens when the cv::watershed function is called. The input marker image is updated to produce the final watershed segmentation. Users can input a marker image with any number of labels with pixels of unknown labeling left to value 0. The marker image has been chosen to be an image of a 32-bit signed integer in order to be able to define more than 255 labels. It also allows the special value -1, to be assigned to pixels associated with a watershed. This is what is returned by the cv::watershed function. To facilitate the displaying of the result, we have introduced two special methods. The first one returns an image of the labels (with watersheds at value 0). This is easily done through thresholding: // Return result in the form of an imagecv::Mat getSegmentation() { cv::Mat tmp; // all segment with label higher than 255 // will be assigned value 255 markers.convertTo(tmp,CV_8U); return tmp;} Similarly, the second method returns an image in which the watershed lines are assigned value 0, and the rest of the image is at 255. This time, the cv::convertTo method is used to achieve this result: // Return watershed in the form of an imagecv::Mat getWatersheds() { cv::Mat tmp; // Each pixel p is transformed into // 255p+255 before conversion markers.convertTo(tmp,CV_8U,255,255); return tmp;} The linear transformation that is applied before the conversion allows -1 pixels to be converted into 0 (since -1*255+255=0). Pixels with a value greater than 255 are assigned the value 255. This is due to the saturation operation that is applied when signed integers are converted into unsigned chars. See also The article The viscous watershed transform by C. Vachier, F. Meyer, Journal of Mathematical Imaging and Vision, volume 22, issue 2-3, May 2005, for more information on the watershed transform. The next recipe which presents another image segmentation algorithm that can also segment an image into background and foreground objects.

0
0
6135

Packt

13 Jun 2011

8 min read

Getting Started with OpenSSO

Packt

13 Jun 2011

8 min read

OpenAM Written and tested with OpenAM Snapshot 9—the Single Sign-On (SSO) tool for securing your web applications in a fast and easy way History of OpenSSO Back in early 2000, Sun Microsystems Inc. started the Directory Server Access Management Edition project to develop a product that would solve the web Single Sign-On (SSO) problems. The initial scope was to just provide authentication and authorization as part of the SSO using a proprietary protocol. Over the years it evolved to become a lightweight pure Java application providing a comprehensive set of security features to protect the enterprise resources. After undergoing a lot of changes in terms of product name, features, among other things, it ended up with OpenSSO. As part of Sun Microsystems software business strategy, they have open sourced most of their commercial enterprise software products including Sun Java Enterprise System (JES), Java SDK, and Communication suite of products. OpenSSO is the term coined by the Sun's access management product team as part of their strategy to open source Sun Java System Access Manager, Sun Java System SAML v2 Plugin for Federation Services, and Sun Java System Federation Manager. OpenSSO is an integrated product that includes the features of both Access and Federation manager products. The goal was to amalgamate the access, federation, and web services features. This goal was achieved when Sun released the OpenSSO Enterprise 8.0 that received strong market traction and analysts coverage. Sun's OpenSSO product made it to the leadership quadrant in 2008 in the Access Management category which was published by the Gartner Analyst group. As part of the open source initiative, a lot of code re-factoring and feature alignments occurred that changed the product's outlook. It removed all the native dependencies that were required to build and deploy earlier. OpenSSO became the pure Java web application that enabled the customers and open source community to take advantage of the automatic deployment by just dropping the web archive on any Java servlet container to deploy it. The build and check-in processes were highly simplified which attracted the open source community to contribute to the code and quality of the product. OpenSSO had a very flexible release model wherein new features or bug fixes could easily be implemented in the customer environment by picking up the nightly, express, or enterprise build. OpenSSO Enterprise 8.0 was a major release by Sun that was built from the open source branch. After this release, there were two other express releases. Those were very feature-rich and introduced Secure Token Service (STS) and OAuth functionality. Express build 9 was not released in the binary form by Oracle but the source code has been made available to the open source community. You can download the OpenAM express build, built using the express build 9 branch from the Forgerock site. As part of the acquisition of Sun Microsystems Inc. by Oracle Corporation that happened back in early 2010, the release and support models have been changed for OpenSSO. If you are interested in obtaining a support contract for the enterprise version of the product, you should call up the Oracle support team or the product management team. Oracle continues its support for the OpenSSO enterprise's existing customers. For the OpenSSO open source version (also known as OpenAM) you can approach the Forgerock team to obtain support. OpenSSO vs. OpenAM OpenSSO was the only open source product in the access management segment that had production level quality. Over eight thousands test cases were executed on twelve different Java servlet containers. OpenSSO is supported by a vibrant community that includes engineers, architects, and solution developers. If you have any questions, just send a mail to users@opensso.dev.java.net, and you are likely get the answer to what you want to know. Recently Forgerock (http://www.forgerock.com) undertook an initiative to keep the community and product strong. They periodically fix the bugs in the open source branch. Their version of OpenSSO is called OpenAM, but the code base is the same as OpenSSO. There may be incompatibilities in future if OpenAM code base deviates a lot from the OpenSSO express build 9 code base. Note that the Oracle Open SSO Enterprise 8.0 update releases are based on the OpenSSO Enterprise release 8.0 code base, whereas the open source version OpenAM is continuing its development from the express build 9 code base. OpenSSO—an overview OpenSSO is a freely available feature-rich access management product; it can be downloaded from http://www.forgerock.com/openam.html. It integrates authentication and authorization services, SSO, and open standards-based federation protocols to provide SSO among disparate business domains, entitlement services, and web services security. Overall, customers will be able to build a comprehensive solution for protecting their network resources by preventing unauthorized access to web services, applications, web content, and securing identity data. OpenSSO offers a complete solution for securing both web applications and web services. You can enforce a comprehensive security policy for web applications and services across the enterprise, rather than relying on developers to come up with ad hoc ways to secure services as they develop them. OpenSSO is a pure Java application that makes it easy to deploy on any operating system platform or container as it supports a broad range of operating systems and servlet containers. OpenSSO services All the services provided by the OpenSSO are exposed over HTTP protocol. The clients access them using appropriate interfaces. OpenSSO exposes a rich library of Application Programming Interfaces (APIs) and Service Provider Interfaces (SPIs) using which, customers can achieve the desired functionality. These services developed for OpenSSO generally contain both a server component and a client component. The server component is a simple Java servlet developed to receive XML requests and return XML responses. The opensso.war web application encompasses all the services and associated configuration items that are required to deliver the OpenSSO functionality. The client component is provided as Java API, and in some cases, C API. This allows remote applications and other OpenSSO services to communicate with and consume the particular functionality. Each core service uses its own framework to retrieve customer and service data and to provide it to other OpenSSO services. The OpenSSO framework integrates all of these service frameworks to form a layer that is accessible to all product components and plugins as shown in the following diagram: There are certain core services that are not covered due to the scope of this article. Just to make you aware of the breadth of features provided by the OpenSSO, in the next few sections, some of the prominent features that are not covered will be briefly introduced. Federation services Typically, in the web access management the Single Sign-On happens in the same company, within the same Domain Name Service (DNS) domain. Most of the time this will work for small companies or in B2C type scenarios, whereas in a B2B scenario use of a DNS domain-based SSO will not work as the cookie will not be forwarded to the other DNS domains. Besides, there are privacy and security concerns to perform SSO across multiple businesses using this approach. So how do we solve these kinds of problems where customers want to seamlessly sign on to services even though the services are provided by a third party? Federation is the solution. So, what is federation? Federation is a process that establishes a standards-based method for sharing and managing identity data and establishing a Single Sign-On across security domains and organizations. It allows an organization to offer a variety of external services to trusted business partners, as well as corporate services to internal departments and divisions. Forming trust relationships across security domains allows an organization to integrate applications offered by different departments or divisions within the enterprise, as well as engage in relationships with co-operating business partners that offer complementary services. Towards the federation or solving SSO across multiple domains, multiple industry standards, such as those developed by the Organization for the Advancement of Structured Information Standards (OASIS) and the Liberty Alliance Project), are supported. OpenSSO provides an open and extensible framework for identity federation and associated web services that resolves the problems of identity-enabling web services, web service discovery and invocation, security, and privacy. Federation services are built on the following standards: Liberty Alliance Project Identity Federation Framework (Liberty ID-FF) 1.1 and 1.2 OASIS Security Assertion Markup Language (SAML) 1.0 and 1.1 OASIS Security Assertion Markup Language (SAML) 2.0 WS-Federation (Passive Requestor Profile) SAML 2.0 is becoming the de facto standard for the federation SAML 2.0 is becoming the de facto standard for the federation SSO as many of the vendors and service providers support SAML 2.0 protocol. For instance Google Apps and Salesforce support SAML 2.0 as their choice of protocol for SSO.

0
0
5781

article-image-learn-computer-vision-applications-open-cv

Packt

07 Jun 2011

16 min read

Learn computer vision applications in Open CV

Packt

07 Jun 2011

16 min read

OpenCV 2 Computer Vision Application Programming Cookbook Over 50 recipes to master this library of programming functions for real-time computer vision Read more about this book OpenCV (Open Source Computer Vision) is an open source library containing more than 500 optimized algorithms for image and video analysis. Since its introduction in 1999, it has been largely adopted as the primary development tool by the community of researchers and developers in computer vision. OpenCV was originally developed at Intel by a team led by Gary Bradski as an initiative to advance research in vision and promote the development of rich, vision-based CPU-intensive applications. In this article by Robert Laganière, author of OpenCV 2 Computer Vision Application Programming Cookbook, we will cover: Calibrating a camera Computing the fundamental matrix of an image pair Matching images using random sample consensus Computing a homography between two images (For more resources related to the article, see here.) Introduction Images are generally produced using a digital camera that captures a scene by projecting light onto an image sensor going through its lens. The fact that an image is formed through the projection of a 3D scene onto a 2D plane imposes the existence of important relations between a scene and its image, and between different images of the same scene. Projective geometry is the tool that is used to describe and characterize, in mathematical terms, the process of image formation. In this article, you will learn some of the fundamental projective relations that exist in multi-view imagery and how these can be used in computer vision programming. But before we start the recipes, let's explore the basic concepts related to scene projection and image formation. Image formation Fundamentally, the process used to produce images has not changed since the beginning of photography. The light coming from an observed scene is captured by a camera through a frontal aperture and the captured light rays hit an image plane (or image sensor) located on the back of the camera. Additionally, a lens is used to concentrate the rays coming from the different scene elements. This process is illustrated by the following figure: Here, do is the distance from the lens to the observed object, di is the distance from the lens to the image plane, and f is the focal length of the lens. These quantities are related by the so-called thin lens equation: In computer vision, this camera model can be simplified in a number of ways. First, we can neglect the effect of the lens by considering a camera with an infinitesimal aperture since, in theory, this does not change the image. Only the central ray is therefore considered. Second, since most of the time we have do>>di, we can assume that the image plane is located at the focal distance. Finally, we can notice from the geometry of the system, that the image on the plane is inverted. We can obtain an identical but upright image by simply positioning the image plane in front of the lens. Obviously, this is not physically feasible, but from a mathematical point of view, this is completely equivalent. This simplified model is often referred to as the pin-hole camera model and it is represented as follows: From this model, and using the law of similar triangles, we can easily derive the basic projective equation: The size (hi) of the image of an object (of height ho) is therefore inversely proportional to its distance (do) from the camera which is naturally true. This relation allows the position of the image of a 3D scene point to be predicted onto the image plane of a camera. Calibrating a camera From the introduction of this article, we learned that the essential parameters of a camera under the pin-hole model are its focal length and the size of the image plane (which defines the field of view of the camera). Also, since we are dealing with digital images, the number of pixels on the image plane is another important characteristic of a camera. Finally, in order to be able to compute the position of an image's scene point in pixel coordinates, we need one additional piece of information. Considering the line coming from the focal point that is orthogonal to the image plane, we need to know at which pixel position this line pierces the image plane. This point is called the principal point. It could be logical to assume that this principal point is at the center of the image plane, but in practice, this one might be off by few pixels depending at which precision the camera has been manufactured. Camera calibration is the process by which the different camera parameters are obtained. One can obviously use the specifications provided by the camera manufacturer, but for some tasks, such as 3D reconstruction, these specifications are not accurate enough. Camera calibration will proceed by showing known patterns to the camera and analyzing the obtained images. An optimization process will then determine the optimal parameter values that explain the observations. This is a complex process but made easy by the availability of OpenCV calibration functions. How to do it... To calibrate a camera, the idea is show to this camera a set of scene points for which their 3D position is known. You must then determine where on the image these points project. Obviously, for accurate results, we need to observe several of these points. One way to achieve this would be to take one picture of a scene with many known 3D points. A more convenient way would be to take several images from different viewpoints of a set of some 3D points. This approach is simpler but requires computing the position of each camera view, in addition to the computation of the internal camera parameters which fortunately is feasible. OpenCV proposes to use a chessboard pattern to generate the set of 3D scene points required for calibration. This pattern creates points at the corners of each square, and since this pattern is flat, we can freely assume that the board is located at Z=0 with the X and Y axes well aligned with the grid. In this case, the calibration process simply consists of showing the chessboard pattern to the camera from different viewpoints. Here is one example of a calibration pattern image: The nice thing is that OpenCV has a function that automatically detects the corners of this chessboard pattern. You simply provide an image and the size of the chessboard used (number of vertical and horizontal inner corner points). The function will return the position of these chessboard corners on the image. If the function fails to find the pattern, then it simply returns false: // output vectors of image pointsstd::vector<cv::Point2f> imageCorners;// number of corners on the chessboardcv::Size boardSize(6,4);// Get the chessboard cornersbool found = cv::findChessboardCorners(image, boardSize, imageCorners); Note that this function accepts additional parameters if one needs to tune the algorithm, which are not discussed here. There is also a function that draws the detected corners on the chessboard image with lines connecting them in sequence: //Draw the cornerscv::drawChessboardCorners(image, boardSize, imageCorners, found); // corners have been found The image obtained is seen here: The lines connecting the points shows the order in which the points are listed in the vector of detected points. Now to calibrate the camera, we need to input a set of such image points together with the coordinate of the corresponding 3D points. Let's encapsulate the calibration process in a CameraCalibrator class: class CameraCalibrator { // input points: // the points in world coordinates std::vector<std::vector<cv::Point3f>> objectPoints; // the point positions in pixels std::vector<std::vector<cv::Point2f>> imagePoints; // output Matrices cv::Mat cameraMatrix; cv::Mat distCoeffs; // flag to specify how calibration is done int flag; // used in image undistortion cv::Mat map1,map2; bool mustInitUndistort; public: CameraCalibrator() : flag(0), mustInitUndistort(true) {}; As mentioned previously, the 3D coordinates of the points on the chessboard pattern can be easily determined if we conveniently place the reference frame on the board. The method that accomplishes this takes a vector of the chessboard image filename as input: // Open chessboard images and extract corner pointsint CameraCalibrator::addChessboardPoints( const std::vector<std::string>& filelist, cv::Size & boardSize) { // the points on the chessboard std::vector<cv::Point2f> imageCorners; std::vector<cv::Point3f> objectCorners; // 3D Scene Points: // Initialize the chessboard corners // in the chessboard reference frame // The corners are at 3D location (X,Y,Z)= (i,j,0) for (int i=0; i<boardSize.height; i++) { for (int j=0; j<boardSize.width; j++) { objectCorners.push_back(cv::Point3f(i, j, 0.0f)); } } // 2D Image points: cv::Mat image; // to contain chessboard image int successes = 0; // for all viewpoints for (int i=0; i<filelist.size(); i++) { // Open the image image = cv::imread(filelist[i],0); // Get the chessboard corners bool found = cv::findChessboardCorners( image, boardSize, imageCorners); // Get subpixel accuracy on the corners cv::cornerSubPix(image, imageCorners, cv::Size(5,5), cv::Size(-1,-1), cv::TermCriteria(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS, 30, // max number of iterations 0.1)); // min accuracy //If we have a good board, add it to our data if (imageCorners.size() == boardSize.area()) { // Add image and scene points from one view addPoints(imageCorners, objectCorners); successes++; } } return successes;} The first loop inputs the 3D coordinates of the chessboard, which are specified in an arbitrary square size unit here. The corresponding image points are the ones provided by the cv::findChessboardCorners function. This is done for all available viewpoints. Moreover, in order to obtain a more accurate image point location, the function cv::cornerSubPix can be used and as the name suggests, the image points will then be localized at sub-pixel accuracy. The termination criterion that is specified by the cv::TermCriteria object defines a maximum number of iterations and a minimum accuracy in sub-pixel coordinates. The first of these two conditions that is reached will stop the corner refinement process. When a set of chessboard corners has been successfully detected, these points are added to our vector of image and scene points: // Add scene points and corresponding image pointsvoid CameraCalibrator::addPoints(const std::vector<cv::Point2f>&imageCorners, const std::vector<cv::Point3f>& objectCorners) { // 2D image points from one view imagePoints.push_back(imageCorners); // corresponding 3D scene points objectPoints.push_back(objectCorners);} The vectors contains std::vector instances. Indeed, each vector element being a vector of points from one view. Once a sufficient number of chessboard images have been processed (and consequently a large number of 3D scene point/2D image point correspondences are available), we can initiate the computation of the calibration parameters: // Calibrate the camera// returns the re-projection errordouble CameraCalibrator::calibrate(cv::Size &imageSize){ // undistorter must be reinitialized mustInitUndistort= true; //Output rotations and translations std::vector<cv::Mat> rvecs, tvecs; // start calibration return calibrateCamera(objectPoints, // the 3D points imagePoints, // the image points imageSize, // image size cameraMatrix,// output camera matrix distCoeffs, // output distortion matrix rvecs, tvecs,// Rs, Ts flag); // set options} In practice, 10 to 20 chessboard images are sufficient, but these must be taken from different viewpoints at different depths. The two important outputs of this function are the camera matrix and the distortion parameters. The camera matrix will be described in the next section. For now, let's consider the distortion parameters. So far, we have mentioned that with the pin-hole camera model, we can neglect the effect of the lens. But this is only possible if the lens used to capture an image does not introduce too important optical distortions. Unfortunately, this is often the case with lenses of lower quality or with lenses having a very short focal length. You may have already noticed that in the image we used for our example, the chessboard pattern shown is clearly distorted. The edges of the rectangular board being curved in the image. It can also be noticed that this distortion becomes more important as we move far from the center of the image. This is a typical distortion observed with fish-eye lens and it is called radial distortion. The lenses that are used in common digital cameras do not exhibit such a high degree of distortion, but in the case of the lens used here, these distortions cannot certainly be ignored. It is possible to compensate for these deformations by introducing an appropriate model. The idea is to represent the distortions induced by a lens by a set of mathematical equations. Once established, these equations can then be reverted in order to undo the distortions visible on the image. Fortunately, the exact parameters of the transformation that will correct the distortions can be obtained together with the other camera parameter during the calibration phase. Once this is done, any image from the newly calibrated camera can be undistorted: // remove distortion in an image (after calibration)cv::Mat CameraCalibrator::remap(const cv::Mat &image) { cv::Mat undistorted; if (mustInitUndistort) { // called once per calibration cv::initUndistortRectifyMap( cameraMatrix, // computed camera matrix distCoeffs, // computed distortion matrix cv::Mat(), // optional rectification (none) cv::Mat(), // camera matrix to generate undistorted image.size(), // size of undistorted CV_32FC1, // type of output map map1, map2); // the x and y mapping functions mustInitUndistort= false; } // Apply mapping functions cv::remap(image, undistorted, map1, map2, cv::INTER_LINEAR); // interpolation type return undistorted;} Which results in the following image: As you can see, once the image is undistorted, we obtain a regular perspective image. How it works... In order to explain the result of the calibration, we need to go back to the figure in the introduction which describes the pin-hole camera model. More specifically, we want to demonstrate the relation between a point in 3D at position (X,Y,Z) and its image (x,y) on a camera specified in pixel coordinates. Let's redraw this figure by adding a reference frame that we position at the center of the projection as seen here: Note that the Y-axis is pointing downward to get a coordinate system compatible with the usual convention that places the image origin at the upper-left corner. We learned previously that the point (X,Y,Z) will be projected onto the image plane at (fX/Z,fY/Z). Now, if we want to translate this coordinate into pixels, we need to divide the 2D image position by, respectively, the pixel width (px) and height (py). We notice that by dividing the focal length f given in world units (most often meters or millimeters) by px, then we obtain the focal length expressed in (horizontal) pixels. Let's then define this term as fx. Similarly, fy =f/py is defined as the focal length expressed in vertical pixel unit. The complete projective equation is therefore: Recall that (u0,v0) is the principal point that is added to the result in order to move the origin to the upper-left corner of the image. These equations can be rewritten in matrix form through the introduction of homogeneous coordinates in which 2D points are represented by 3-vectors, and 3D points represented by 4-vectors (the extra coordinate is simply an arbitrary scale factor that need to be removed when a 2D coordinate needs to be extracted from a homogeneous 3-vector). Here is the projective equation rewritten: The second matrix is a simple projection matrix. The first matrix includes all of the camera parameters which are called the intrinsic parameters of the camera. This 3x3 matrix is one of the output matrices returned by the cv::calibrateCamera function. There is also a function called cv::calibrationMatrixValues that returns the value of the intrinsic parameters given a calibration matrix. More generally, when the reference frame is not at the projection center of the camera, we will need to add a rotation (a 3x3 matrix) and a translation vector (3x1 matrix). These two matrices describe the rigid transformation that must be applied to the 3D points in order to bring them back to the camera reference frame. Therefore, we can rewrite the projection equation in its most general form: Remember that in our calibration example, the reference frame was placed on the chessboard. Therefore, there is a rigid transformation (rotation and translation) that must be computed for each view. These are in the output parameter list of the cv::calibrateCamera function. The rotation and translation components are often called the extrinsic parameters of the calibration and they are different for each view. The intrinsic parameters remain constant for a given camera/lens system. The intrinsic parameters of our test camera obtained from a calibration based on 20 chessboard images are fx=167, fy=178, u0=156, v0=119. These results are obtained by cv::calibrateCamera through an optimization process aimed at finding the intrinsic and extrinsic parameters that will minimize the difference between the predicted image point position, as computed from the projection of the 3D scene points, and the actual image point position, as observed on the image. The sum of this difference for all points specified during the calibration is called the re-projection error. To correct the distortion, OpenCV uses a polynomial function that is applied to the image point in order to move them at their undistorted position. By default, 5 coefficients are used; a model made of 8 coefficients is also available. Once these coefficients are obtained, it is possible to compute 2 mapping functions (one for the x coordinate and one for the y) that will give the new undistorted position of an image point on a distorted image. This is computed by the function cv::initUndistortRectifyMap and the function cv::remap remaps all of the points of an input image to a new image. Note that because of the non-linear transformation, some pixels of the input image now fall outside the boundary of the output image. You can expand the size of the output image to compensate for this loss of pixels, but you will now obtain output pixels that have no values in the input image (they will then be displayed as black pixels). There's more... When a good estimate of the camera intrinsic parameters are known, it could be advantageous to input them to the cv::calibrateCamera function. They will then be used as initial values in the optimization process. To do so, you just need to add the flag CV_CALIB_USE_INTRINSIC_GUESS and input these values in the calibration matrix parameter. It is also possible to impose a fixed value for the principal point (CV_CALIB_ FIX_PRINCIPAL_POINT), which can often be assumed to be the central pixel. You can also impose a fixed ratio for the focal lengths fx and fy (CV_CALIB_FIX_RATIO) in which case you assume pixels of square shape.

0
1
7175

How-To Tutorials - Data

Getting Started with Oracle Information Integration

Oracle Tools and Products

Integrating Kettle and the Pentaho Suite

Excel 2010 Financials: Identifying the Profitability of an Investment

Getting Started with Oracle GoldenGate

Building Financial Functions into Excel 2010

Designing User Security for Oracle PeopleSoft Applications

An Overview of Oracle PeopleSoft Commitment Control

Excel 2010 Financials: Using Graphs for Analysis

Excel 2010 Financials: Adding Animations to Excel Graphs

Trending Topics

Pentaho Data Integration 4: working with complex data flows

Pentaho Data Integration 4: Understanding Data Flows

Segmenting images in OpenCV

Getting Started with OpenSSO

Learn computer vision applications in Open CV

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access