Home Cloud & Networking Oracle 11g R1 / R2 Real Application Clusters Handbook

Oracle 11g R1 / R2 Real Application Clusters Handbook

books-svg-icon Book
eBook $61.99 $42.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $61.99 $42.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    High Availability
About this book
RAC or Real Application Clusters is a grid computing solution that allows multiple nodes (servers) in a clustered system to mount and open a single database that resides on shared disk storage. Should a single system (node) fail, the database service will still be available on the remaining nodes. RAC is an integral part of the Oracle database setup: one database, multiple users accessing it, in real time. This book will enable DBAs to get their finger on the pulse of the Oracle 11g RAC environment quickly and easily. This practical handbook documents how to administer a complex Oracle 11g RAC environment. It covers all areas of the Oracle 11g R1 RAC environment, with bonus R2 information included, and is indispensable if you are an Oracle DBA charged with configuring and implementing Oracle11g. It presents a complete method for the design, installation, and configuration of Oracle 11g RAC, ultimately enabling rapid administration of Oracle 11g RAC environments.Packed with real-world examples, expert tips, and troubleshooting advice, the book begins by introducing the concept of RAC and High Availability. It then dives deep into the world of RAC design, installation, and configuration, enabling you to support complex RAC environments for real-world deployments. Chapters cover RAC and High Availability, Oracle 11g RAC Architecture, Oracle 11g RAC Installation, Automatic Storage Management, Troubleshooting, Workload Management, and much more. By following the practical examples in the book, you will learn every concept of the RAC environment and how to successfully support complex Oracle 11g R1 and R2 RAC environments for various deployments in real-world situations.
Publication date:
July 2010
Publisher
Packt
Pages
676
ISBN
9781847199621

 

Chapter 1. High Availability

High availability is a discipline within database technology that provides a solution to protect against data loss and against downtime, which is costly to mission-critical database systems. In this chapter, we will discuss how Oracle 11g RAC provides you with mission-critical options for minimizing outages and downtime as well as how RAC fits into the overall scheme for maintenance of a comprehensive disaster recovery and business continuity policy. In this chapter, we will provide you with an introduction to the high availability concepts and solutions that are workable for Oracle 11g. As such, we will provide details on what constitutes high availability and what does not. By having the proper framework, you will understand how to leverage Oracle RAC and auxiliary technologies including Oracle Data Guard to maximize the Return On Investment (ROI) for your data center environment. In summary, we will discuss the following topics:

  • High Availability concepts

  • Fault Tolerant Systems and High Availability

  • High Availability solutions for Oracle 11g R1 and 11g R2 Real Application Clusters (RAC)

High availability concepts

High Availability provides data center environments that run mission-critical database applications with the resiliency to withstand failures that may occur due to natural, human, or environmental conditions. For example, if a hurricane wipes out the production data center that hosts a financial application's production database, high availability would provide the much needed protection to avoid data loss, minimize downtime, and maximize availability of the firm's resources and database applications. Let's now move to the High Availability concepts.

Planned versus unplanned downtime

The distinction needs to be made between planned downtime and unplanned downtime. In most cases, planned downtime is the result of maintenance that is disruptive to system operations and cannot be avoided with current system designs for a data center. An example of planned downtime would be a DBA maintenance activity such as database patching to an Oracle database that would require taking an outage to take the system offline for a period of time. From the database administrator's perspective, planned downtime situations usually are the result of management-initiated events.

On the other hand, unplanned downtime issues frequently occur due to a physical event caused by a hardware, software, or environmental failure or caused by human error. A few examples of unplanned downtime events include hardware server component failures such as CPU, disk, or power outages.

Most data centers will exclude planned downtime from the high availability factor in terms of calculating the current total availability percentage. Even so, both planned and unplanned maintenance windows affect high availability. For instance, database upgrades require a few hours of downtime. Another example would be a SAN replacement. Such items make comprehensive four nine solutions nigh impossible to implement without additional considerations. The fact is that implementing a true 100% high availability is nearly impossible without exorbitant costs. To have complete high availability for all components within the data center requires an architecture for all systems and databases that eliminates any Single Point of Failure (SPOF) and allows for total online availability for all server hardware, network, operating systems, applications, and database systems.

Service Level Agreements for high availability

When it comes to determining high availability ratios, this is often expressed as the percentage of uptime in a given year. The following table shows the approximate downtime that is allowed for a specific percentage of high availability, granted that the system is required to operate continuously. Service Level Agreements (SLAs) usually refer to monthly downtime or availability in order to calculate service levels to match monthly financial cycles. The following table from the International Organization for Standardization (ISO) illustrates the correlation from a given availability percentage to the relevant amount of time a system would be unavailable per year, month, or week:

Availability %

Annual downtime

Monthly downtime*

Weekly downtime

90%

36.5 days

72 hours

16.8 hours

95%

18.25 days

36 hours

8.4 hours

98%

7.30 days

14.4 hours

3.36 hours

99%

3.65 days

7.20 hours

1.68 hours

99.5%

1.83 days

3.60 hours

50.4 minutes

99.8%

17.52 hours

86.23 minutes

20.16 minutes

99.9% ("three nines")

8.76 hours

43.2 minutes

10.1 minutes

99.95%

4.38 hours

21.56 minutes

5.04 minutes

99.99% ("four nines")

52.6 minutes

4.32 minutes

1.01 minutes

99.999% ("five nines")

5.26 minutes

25.9 seconds

6.05 seconds

99.9999% ("six nines")

31.5 seconds

2.59 seconds

0.605 seconds

Note

For monthly calculations, a 30-day month is used.

It should be noted that availability and uptimes are not the same thing. For instance, a database system may be online but not available, as in the case of application outages such as when a user's SQL script cannot be executed.

In most cases, the number of nines is not often used by the database or system professional when measuring high availability for data center environments because it is difficult to extrapolate such hard numbers without a large test environment. For practical purposes, availability is calculated more as a probability or average downtime given per annual basis.

High availability interpretations

When it comes to discussing how availability is measured, there is a debate on the correct method of interpretation for high availability ratios. For instance, an Oracle database server that has been online for 365 days in a given non-leap year might have been eclipsed by an application failure that lasted for nine hours during a peak usage period. As a consequence, the users will see the complete system as unavailable, whereas the Oracle database administrator will claim 100% "uptime." However, given the true definition for availability, the Oracle database will be approximately 99.897% available (8751 hours of available timeout of 8760 hours per non-leap year). Furthermore, Oracle database systems experiencing performance problems are often deemed partially or entirely unavailable by users, while in the eyes of the database administrator the system is fine and available.

Another situation that presents a challenge in terms of what constitutes availability would be the scenario in which the availability of a mission-critical application might go offline yet is not viewed as unavailable by the Oracle DBA as the database instance could still be online and thus available. However, the application in question is offline to the end user thus presenting a status of unavailable from the perspective of the end user. This illustrates the key point that a true availability measure must be a holistic perspective and not strictly from the database's point of view.

Availability should be measured with comprehensive monitoring tools that are themselves highly available and present the proper instrumentation. If there is a lack of instrumentation, systems supporting high volume transaction processing frequently during day and night such as credit card processing database servers are often inherently better monitored than systems that experience a periodic lull in demand. Currently, custom scripts can be developed in conjunction with third-party tools to provide a measure of availability. One such tool that we recommend for monitoring database, server, and application availability is that provided by Oracle Grid Control, which also includes Oracle Enterprise Manager.

Oracle Grid Control provides instrumentation via agents and plugin modules to measure availability and performance on a system-wide enterprise level, thereby greatly aiding the Oracle database professional to measure, track, and report to management and users on the status for availability with all mission-critical applications and system components. However, the current version of Oracle Enterprise Manager will not provide a true picture of availability until 11g Grid Control is released in the future.

Recovery time and High Availability

Recovery time is closely related to the concept of high availability. Recovery time varies based on system design and failure experienced, in that a full recovery may well be impossible if the system design prevents such recovery options. For example, if the data center is not designed correctly with the required system and database backups and a standby disaster recovery site in place, then a major catastrophe such as a fire or earthquake will almost always result in complete unavailability until a complete MAA solution is implemented. In this case, only a partial recovery may be possible. This drives home the point that for all major data center operations, you should always have a backup plan with an offsite secondary disaster recovery data center to protect against losing all critical systems and data.

In terms of database administration for Oracle data centers, the concept of data availability is essential when dealing with recovery time and planning for highly available options. Data availability references the degree to which databases such as Oracle record and report transactions. Data management professionals often focus just on data availability in order to judge what constitutes an acceptable data loss with different types of failure events. While application service interruptions are inconvenient and sometimes permitted, data loss is not to be tolerated. As one Chief Information Officer (CIO) and executive once told us while working for a large financial brokerage, you can have the system down to perform maintenance but never ever lose my data!

The next item related to high availability and recovery standards is that of SLevel Agreements or SLAs for data center operations. The purpose of the Service Level Agreement is to actualize the availability objectives and requirements for a data center environment per business requirements into a standard corporate information technology (IT) policy.

System design for high availability

Ironically, by adding further components to the overall system and database architecture design, you may actually undermine your efforts to achieve true high availability for your Oracle data center environment. The reason for this is by their very nature, complex systems inherently have more potential failure points and thus are more difficult to implement properly. The most highly available systems for Oracle adhere to a simple design pattern that makes use of a single, high quality, multipurpose physical system with comprehensive internal redundancy running all interdependent functions, paired with a second like system at a separate physical location. An example would be to have a primary Oracle RAC clustered site with a second Disaster Recovery site at another location with Oracle Data Guard and perhaps dual Oracle RAC clusters at both sites connected by stretch clusters. The best possible way to implement an active standby site with Oracle would be to have Oracle Streams and Oracle Data Guard. Large commercial banking and insurance institutions would benefit from this model for Oracle data center design to maximize system availability.

Business Continuity and high availability

Business Continuity Planning (BCP) refers to the creation and validation of a rehearsed operations plan for the IT organization that explains the procedures of how the data center and business unit will recover and restore, partially or completely, interrupted business functions within a predetermined time after a major disaster. The following diagram illustrates the logical flow of core business continuity life cycle processes:

In its simplest terms, BCP is the foundation for the IT data center operations team to maintain critical systems in the event of disaster. Major incidents could include events such as fires, earthquakes, or national acts of terrorism.

BCP may also encompass corporate training efforts to help reduce operational risk factors associated with the lack of information technology (IT) management controls. These BCP processes may also be integrated with IT standards and practices to improve security and corporate risk management practices. An example would be to implement BCP controls as part of Sarbanes-Oxley (SOX) compliance requirements for publicly traded corporations.

The origins for BCP standards arose from the British Standards Institution (BSI) in 2006 when the BSI released a new independent standard for business continuity called BS 25999-1. Prior to the introduction of this standard for BCP, IT professionals had to rely on the previous BSI information security standard, BS 7799, which provided only limited standards for business continuity compliance procedures. One of the key benefits of these new standards was to extend additional practices for business continuity to a wider variety of organizations, to cover needs for public sector, government, non-profit, and private corporations.

Disaster Recovery

Disaster Recovery (DR) is the process, policies, and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after either a natural or human caused disaster.

Disaster Recovery Planning (DRP) is a subset of larger processes such as Business Continuity and should include planning for resumption of applications, databases, hardware, networking, and other IT infrastructure components. A Business Continuity Plan includes planning for non-IT-related aspects such as staff member activities during a major disaster as well as site facility operations, and it should reference the Disaster Recovery Plan for IT-related infrastructure recovery and business continuity procedures and guidelines.

Business Continuity and Disaster Recovery guidelines

The following recommendations will provide you with a blueprint to formulate your requirements and implementation for a robust Business Continuity and Disaster Recovery plan:

  1. 1. Identifying the scope and boundaries of your Business Continuity Plan:

    The first step enables you to define the scope of your new business continuity plan. It provides you with the idea for limitations and boundaries of the business continuity plan. It also includes important audit and risk analysis reports for corporate assets.

  2. 2. Conducting a Business Impact Analysis session:

    Business Impact Analysis (BIA) is the assessment of financial losses to institutions, which usually results as the consequence of destructive events such as the loss or unavailability of mission-critical business services.

  3. 3. Obtaining support for your business continuity plans and goals from the executive management team:

    You will need to convince senior management to approve your business continuity plan so that you can flawlessly execute your disaster recovery planning. Assign stakeholders as representatives on the project planning committee team, once approval is obtained from the corporate executive team.

  4. 4. Understanding its specific role:

    In the possible event of a major disaster, each of your departments must be prepared to take immediate action. In order to successfully recover your mission-critical database systems with minimal loss, each team must understand the BCP and DRP plans as well as follow them correctly. Furthermore, it is also important to maintain your DRP and BCP plans as well as conduct periodic training of your IT staff members on a regular basis to have successful response time for emergencies. Such "smoke tests" to train and keep your IT staff members up-to-date on the correct procedures and communications will pay major dividends in the event of an unforeseen disaster.

One useful tool for creating and managing BCP plans is available from the National Institute of Standards and Technologies (NIST). The NIST documentation can be used to generate templates that can be used as an excellent starting point for your Business Continuity and Disaster Recovery planning. We highly recommend that you download and review the following NIST publication for creating and evaluating BCP plans, Contingency Planning Guide for Information Technology Systems, which is available online at http://csrc.nist.gov/publications/nistpubs/800-34/sp800-34.pdf.

Additional NIST documents may also provide insight into how best to manage new or current BCP or DRP plans. A complete listing of NIST publications is available online at http://csrc.nist.gov/publications/PubsSPs.html.

 

High availability concepts


High Availability provides data center environments that run mission-critical database applications with the resiliency to withstand failures that may occur due to natural, human, or environmental conditions. For example, if a hurricane wipes out the production data center that hosts a financial application's production database, high availability would provide the much needed protection to avoid data loss, minimize downtime, and maximize availability of the firm's resources and database applications. Let's now move to the High Availability concepts.

Planned versus unplanned downtime

The distinction needs to be made between planned downtime and unplanned downtime. In most cases, planned downtime is the result of maintenance that is disruptive to system operations and cannot be avoided with current system designs for a data center. An example of planned downtime would be a DBA maintenance activity such as database patching to an Oracle database that would require taking an outage to take the system offline for a period of time. From the database administrator's perspective, planned downtime situations usually are the result of management-initiated events.

On the other hand, unplanned downtime issues frequently occur due to a physical event caused by a hardware, software, or environmental failure or caused by human error. A few examples of unplanned downtime events include hardware server component failures such as CPU, disk, or power outages.

Most data centers will exclude planned downtime from the high availability factor in terms of calculating the current total availability percentage. Even so, both planned and unplanned maintenance windows affect high availability. For instance, database upgrades require a few hours of downtime. Another example would be a SAN replacement. Such items make comprehensive four nine solutions nigh impossible to implement without additional considerations. The fact is that implementing a true 100% high availability is nearly impossible without exorbitant costs. To have complete high availability for all components within the data center requires an architecture for all systems and databases that eliminates any Single Point of Failure (SPOF) and allows for total online availability for all server hardware, network, operating systems, applications, and database systems.

Service Level Agreements for high availability

When it comes to determining high availability ratios, this is often expressed as the percentage of uptime in a given year. The following table shows the approximate downtime that is allowed for a specific percentage of high availability, granted that the system is required to operate continuously. Service Level Agreements (SLAs) usually refer to monthly downtime or availability in order to calculate service levels to match monthly financial cycles. The following table from the International Organization for Standardization (ISO) illustrates the correlation from a given availability percentage to the relevant amount of time a system would be unavailable per year, month, or week:

Availability %

Annual downtime

Monthly downtime*

Weekly downtime

90%

36.5 days

72 hours

16.8 hours

95%

18.25 days

36 hours

8.4 hours

98%

7.30 days

14.4 hours

3.36 hours

99%

3.65 days

7.20 hours

1.68 hours

99.5%

1.83 days

3.60 hours

50.4 minutes

99.8%

17.52 hours

86.23 minutes

20.16 minutes

99.9% ("three nines")

8.76 hours

43.2 minutes

10.1 minutes

99.95%

4.38 hours

21.56 minutes

5.04 minutes

99.99% ("four nines")

52.6 minutes

4.32 minutes

1.01 minutes

99.999% ("five nines")

5.26 minutes

25.9 seconds

6.05 seconds

99.9999% ("six nines")

31.5 seconds

2.59 seconds

0.605 seconds

Note

For monthly calculations, a 30-day month is used.

It should be noted that availability and uptimes are not the same thing. For instance, a database system may be online but not available, as in the case of application outages such as when a user's SQL script cannot be executed.

In most cases, the number of nines is not often used by the database or system professional when measuring high availability for data center environments because it is difficult to extrapolate such hard numbers without a large test environment. For practical purposes, availability is calculated more as a probability or average downtime given per annual basis.

High availability interpretations

When it comes to discussing how availability is measured, there is a debate on the correct method of interpretation for high availability ratios. For instance, an Oracle database server that has been online for 365 days in a given non-leap year might have been eclipsed by an application failure that lasted for nine hours during a peak usage period. As a consequence, the users will see the complete system as unavailable, whereas the Oracle database administrator will claim 100% "uptime." However, given the true definition for availability, the Oracle database will be approximately 99.897% available (8751 hours of available timeout of 8760 hours per non-leap year). Furthermore, Oracle database systems experiencing performance problems are often deemed partially or entirely unavailable by users, while in the eyes of the database administrator the system is fine and available.

Another situation that presents a challenge in terms of what constitutes availability would be the scenario in which the availability of a mission-critical application might go offline yet is not viewed as unavailable by the Oracle DBA as the database instance could still be online and thus available. However, the application in question is offline to the end user thus presenting a status of unavailable from the perspective of the end user. This illustrates the key point that a true availability measure must be a holistic perspective and not strictly from the database's point of view.

Availability should be measured with comprehensive monitoring tools that are themselves highly available and present the proper instrumentation. If there is a lack of instrumentation, systems supporting high volume transaction processing frequently during day and night such as credit card processing database servers are often inherently better monitored than systems that experience a periodic lull in demand. Currently, custom scripts can be developed in conjunction with third-party tools to provide a measure of availability. One such tool that we recommend for monitoring database, server, and application availability is that provided by Oracle Grid Control, which also includes Oracle Enterprise Manager.

Oracle Grid Control provides instrumentation via agents and plugin modules to measure availability and performance on a system-wide enterprise level, thereby greatly aiding the Oracle database professional to measure, track, and report to management and users on the status for availability with all mission-critical applications and system components. However, the current version of Oracle Enterprise Manager will not provide a true picture of availability until 11g Grid Control is released in the future.

Recovery time and High Availability

Recovery time is closely related to the concept of high availability. Recovery time varies based on system design and failure experienced, in that a full recovery may well be impossible if the system design prevents such recovery options. For example, if the data center is not designed correctly with the required system and database backups and a standby disaster recovery site in place, then a major catastrophe such as a fire or earthquake will almost always result in complete unavailability until a complete MAA solution is implemented. In this case, only a partial recovery may be possible. This drives home the point that for all major data center operations, you should always have a backup plan with an offsite secondary disaster recovery data center to protect against losing all critical systems and data.

In terms of database administration for Oracle data centers, the concept of data availability is essential when dealing with recovery time and planning for highly available options. Data availability references the degree to which databases such as Oracle record and report transactions. Data management professionals often focus just on data availability in order to judge what constitutes an acceptable data loss with different types of failure events. While application service interruptions are inconvenient and sometimes permitted, data loss is not to be tolerated. As one Chief Information Officer (CIO) and executive once told us while working for a large financial brokerage, you can have the system down to perform maintenance but never ever lose my data!

The next item related to high availability and recovery standards is that of SLevel Agreements or SLAs for data center operations. The purpose of the Service Level Agreement is to actualize the availability objectives and requirements for a data center environment per business requirements into a standard corporate information technology (IT) policy.

System design for high availability

Ironically, by adding further components to the overall system and database architecture design, you may actually undermine your efforts to achieve true high availability for your Oracle data center environment. The reason for this is by their very nature, complex systems inherently have more potential failure points and thus are more difficult to implement properly. The most highly available systems for Oracle adhere to a simple design pattern that makes use of a single, high quality, multipurpose physical system with comprehensive internal redundancy running all interdependent functions, paired with a second like system at a separate physical location. An example would be to have a primary Oracle RAC clustered site with a second Disaster Recovery site at another location with Oracle Data Guard and perhaps dual Oracle RAC clusters at both sites connected by stretch clusters. The best possible way to implement an active standby site with Oracle would be to have Oracle Streams and Oracle Data Guard. Large commercial banking and insurance institutions would benefit from this model for Oracle data center design to maximize system availability.

Business Continuity and high availability

Business Continuity Planning (BCP) refers to the creation and validation of a rehearsed operations plan for the IT organization that explains the procedures of how the data center and business unit will recover and restore, partially or completely, interrupted business functions within a predetermined time after a major disaster. The following diagram illustrates the logical flow of core business continuity life cycle processes:

In its simplest terms, BCP is the foundation for the IT data center operations team to maintain critical systems in the event of disaster. Major incidents could include events such as fires, earthquakes, or national acts of terrorism.

BCP may also encompass corporate training efforts to help reduce operational risk factors associated with the lack of information technology (IT) management controls. These BCP processes may also be integrated with IT standards and practices to improve security and corporate risk management practices. An example would be to implement BCP controls as part of Sarbanes-Oxley (SOX) compliance requirements for publicly traded corporations.

The origins for BCP standards arose from the British Standards Institution (BSI) in 2006 when the BSI released a new independent standard for business continuity called BS 25999-1. Prior to the introduction of this standard for BCP, IT professionals had to rely on the previous BSI information security standard, BS 7799, which provided only limited standards for business continuity compliance procedures. One of the key benefits of these new standards was to extend additional practices for business continuity to a wider variety of organizations, to cover needs for public sector, government, non-profit, and private corporations.

Disaster Recovery

Disaster Recovery (DR) is the process, policies, and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after either a natural or human caused disaster.

Disaster Recovery Planning (DRP) is a subset of larger processes such as Business Continuity and should include planning for resumption of applications, databases, hardware, networking, and other IT infrastructure components. A Business Continuity Plan includes planning for non-IT-related aspects such as staff member activities during a major disaster as well as site facility operations, and it should reference the Disaster Recovery Plan for IT-related infrastructure recovery and business continuity procedures and guidelines.

Business Continuity and Disaster Recovery guidelines

The following recommendations will provide you with a blueprint to formulate your requirements and implementation for a robust Business Continuity and Disaster Recovery plan:

  1. 1. Identifying the scope and boundaries of your Business Continuity Plan:

    The first step enables you to define the scope of your new business continuity plan. It provides you with the idea for limitations and boundaries of the business continuity plan. It also includes important audit and risk analysis reports for corporate assets.

  2. 2. Conducting a Business Impact Analysis session:

    Business Impact Analysis (BIA) is the assessment of financial losses to institutions, which usually results as the consequence of destructive events such as the loss or unavailability of mission-critical business services.

  3. 3. Obtaining support for your business continuity plans and goals from the executive management team:

    You will need to convince senior management to approve your business continuity plan so that you can flawlessly execute your disaster recovery planning. Assign stakeholders as representatives on the project planning committee team, once approval is obtained from the corporate executive team.

  4. 4. Understanding its specific role:

    In the possible event of a major disaster, each of your departments must be prepared to take immediate action. In order to successfully recover your mission-critical database systems with minimal loss, each team must understand the BCP and DRP plans as well as follow them correctly. Furthermore, it is also important to maintain your DRP and BCP plans as well as conduct periodic training of your IT staff members on a regular basis to have successful response time for emergencies. Such "smoke tests" to train and keep your IT staff members up-to-date on the correct procedures and communications will pay major dividends in the event of an unforeseen disaster.

One useful tool for creating and managing BCP plans is available from the National Institute of Standards and Technologies (NIST). The NIST documentation can be used to generate templates that can be used as an excellent starting point for your Business Continuity and Disaster Recovery planning. We highly recommend that you download and review the following NIST publication for creating and evaluating BCP plans, Contingency Planning Guide for Information Technology Systems, which is available online at http://csrc.nist.gov/publications/nistpubs/800-34/sp800-34.pdf.

Additional NIST documents may also provide insight into how best to manage new or current BCP or DRP plans. A complete listing of NIST publications is available online at http://csrc.nist.gov/publications/PubsSPs.html.

 

Fault tolerant systems and high availability


Fault Tolerance is data center technology that enables a system to continue to function correctly in the face of a failure with one or more faults within any given key component of the system architecture or data center. If operating quality experiences major degradation, the decrease in functionality of the environment is usually in direct proportion to the severity of the failure, whereas a poorly designed system will completely fail and breakdown with a small failure. In other words, fault tolerance gives you that added layer of protection and support to avoid a total meltdown of your mission-critical data center and, in our case, Oracle servers and database systems. Fault-tolerance is often associated with highly available systems such as those found with Oracle Data Guard and Oracle RAC technologies.

Data formats may also be designed to degrade gracefully. For example, in the case of Oracle RAC environments, services provide for load balancing to minimize performance issues in the event that one or more nodes in the cluster are lost due to an unforeseen event.

Recovery from errors in fault tolerant systems provide for either rollforward or rollback operations. For instance, whenever the Oracle server detects that it has an error condition and cannot find data from a missed transaction, rollback will occur either at the instance level or application level (a transaction must be atomic in that all elements must commit or rollback). Oracle takes the system state at that time and rolls back transactional changes to be able to move forward. Whenever a rollback is required for a transaction within Oracle, Oracle reverts the system state to some earlier correct version—for example, using the database checkpoint and rollback process inherent in the Oracle database engine and moving forward from there.

Rollback recovery requires that the operations between the checkpoint (implicit checkpoints are NEVER required for transactional recovery) and the detected erroneous state can be made to be transparent. Some systems make use of both rollforward and rollback recovery for different errors or different parts of one error.

For Oracle, database recovery always rolls back failed transactions and restores the state of the rollback or undo from which it then rolls forward using the contents of the rollback or undo segments. However, when it comes to transactional-based recovery, Oracle only rolls back. Within the scope of an individual system, fault-tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and in general, aiming for self-healing so that the system converges towards an error-free state. In any case, if the consequence of a system failure is catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to rollback recovery but can be a human action if humans are present in the loop.

Requirements for implementing fault tolerance

The basic characteristics of fault tolerance require:

  • No single point of failure

  • No single point of repair

  • Fault isolation to the failing component

  • Fault containment to prevent propagation of the failure

  • Availability of reversion modes

In addition, fault tolerant systems are characterized in terms of both planned and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. For instance, a five nine system would therefore statistically provide 99.999% availability. Fault-tolerant systems are typically based on the concept of redundancy. In theory, this would be ideal; however, in reality this is an elusive impractical goal. Due to the time required to fail over, reestablish middle tier connections, and perform application restarts, it is not realistic to have complete availability. We can obtain four nines as the best goal for high availability with Oracle systems. For Oracle RAC, you can deploy a fault tolerant environment by using multiple network interface cards, dual Host Bus Adapters (HBAs), and multiple switches to avoid any Single Point of Failure.

Fault tolerance and replication

By using spare components, we address the first fundamental characteristic of fault-tolerance in two ways as shown next:

  • Replication: This provides multiple identical instances of the same system or subsystem by directing tasks or requests to all of them simultaneoulsy. Oracle Streams and Oracle GoldenGate, as well as third-party solutions such as Quest Shareplex, are replication technologies.

  • Redundancy: This provides you with multiple identical instances of the same system and switching to one of the remaining instances in case of a failure. This switchover and failover process is available with standby database technology with Oracle Data Guard. Oracle RAC also provides node/server failover capability with the use of services by using Fast Connection Failover (FCF) and with Fast Application Notification (FAN).

At the storage layer, the major implementations of RAID (Redundant Array of Independent Disks) with the exception of disk striping (RAID 0) provide you with fault-tolerant appliances that also use data redundancy.

Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.

One variant of Data Mirror Replication (DMR) is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of DMR, but has been used commercially.

If a system experiences a failure, it must continue to operate without interruption during the repair process.

When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation.

Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology (NIST) categorizes faults based on locality, cause, duration, and effect.

 

High Availability solutions for Oracle


Oracle introduced the concept of the Maximum Availability Architecture (MAA) as the foundation of the high availability architecture for mission-critical applications and databases that run in large corporate data centers. Maximum Availability refers to a comprehensive end-to-end solution developed for large, mission-critical data centers that require all layers of the application, data, and system environment to be fully redundant — for example, fault tolerant, with zero data loss, and maximum uptime to protect against loss in system performance and availability. Moreover, it provides application server protection with the Oracle Application Server topology which includes middleware services, database tier with Oracle Data Guard, and system availability with Oracle RAC. The following diagram illustrates a typical architecture that implements the Maximum Availability Architecture (MAA) from Oracle. For large data center environments, we recommend that you implement a design for high availability based on this recommendation from Oracle.

There are four High Availability solutions for Oracle:

  • Oracle Data Guard

  • Oracle Streams

  • Oracle Application Server Clustering

  • High Availability—Oracle 11g R1 and 11g R2 Real Application Clusters (RAC)

Oracle Data Guard

Oracle provides a true disaster recovery solution with Oracle Data Guard. Data Guard provides a standby database environment that can be used for failover or switchover operations in the event of a database failure that may occur at the primary database site. Data Guard is a complex technology that is best explained with an architectural diagram. The following diagram illustrates an example of how a typical disaster recovery plan would set up a primary and standby data center site to deploy Oracle Data Guard technology for high availability purposes.

A complete discussion of Data Guard is beyond the scope of this book. Since Data Guard requires special care and feeding with Oracle RAC environments, we will present a later chapter on how to integrate and manage a Data Guard physical standby solution with RAC environments.

Oracle Streams

Another option for implementing the Maximum Availability Architecture (MAA) blueprint for high availability is to use Oracle Streams or Oracle GoldenGate with the Oracle RAC environments.

Oracle Streams and Oracle GoldenGate are replication technologies that allow you to replicate a copy of your database or subset of database tables to another site. Oracle Streams is not a true disaster recovery solution or high availability option, but more of a complementary solution to enhance the availability options provided by Oracle Data Guard and Oracle RAC technologies. One of the most common ways to use this technology is with large Oracle data warehouses and data marts to replicate a subset of the source data to another environment for testing and verification purposes. A better solution would be to complement the replication technologies with transportable tablespaces to enhance performance, as TTS has robust performance advantages over replication technologies. Oracle Streams uses Advanced Queuing (AQ) as the foundation of its model for propagating changes between master and target replication sites.

In addition to Data Guard and Streams, as part of the Oracle Maximum Availability Architecture (MAA) solutions, we also have the failover and clustering with Oracle Application Server Fusion Middleware servers.

Oracle Application Server Clustering

Oracle Application Servers form the core web and application layer foundation for many large data center environments. In this day and age of e-commerce and intranet site operations, Oracle Application Servers are the key components in a data center environment. Furthermore, many large firms use Oracle EBS or Oracle Application environments such as Oracle 11i or Oracle 12i Financials to manage the business operations for large financial transactions and reporting. As such, Oracle Application Servers are the middle tier or application broker component of the Oracle Applications environments.

In order to implement true Disaster Recovery (DR) for high availability and protection against costly downtime and application data loss, Oracle provides clustering and failover technology as part of the Oracle Application Server environments. The following diagram illustrates a basic Oracle Application Server environment with hardware clustering and Oracle Application clustering. It uses the virtual hosts to provide for either failover or Cold Failover Clustering (CFC) options:

In our coverage of the Oracle Maximum Availability Architecture (MAA), we introduced Data Guard, Streams, and Application Server clustering and failover. Now we will present how Oracle RAC fits into the grand scheme of this high available paradigm.

High Availability: Oracle 11g R1 Real Application Clusters (RAC)

Oracle 11g R1 RAC provides a combination of options that could be considered to be a high availability solution. It provides server level redundancy as well as database instance availability by clustering hardware and database resources. However, RAC is not a true disaster recovery solution because it does not protect against site failure or database failure.

The reason is that with an Oracle RAC configuration, the database is shared by nodes in the cluster and staged on shared storage which is a Single Point of Failure (SPOF). If the RAC database is lost, the entire cluster will fail. Many people incorrectly assume that RAC is a true Disaster Recovery (DR) solution when in fact, it is not. For a true disaster recovery solution with Oracle, you would need to implement Data Guard to protect against site and data failure events.

High Availability: Oracle 11g R2 Real Application Clusters (RAC)

Among the numerous enhancements to the Oracle 11g RAC technology, the following new features of Oracle 11g R2 RAC improve on high availability for Oracle database technology:

  • Oracle Automatic Storage Management Cluster File System (Oracle ACFS): A new scalable filesystem that extends Oracle ASM configurations and provides robust performance and availability functionality for Oracle ASM files.

  • Snapshot copy for Oracle ACFS: Provides point in time copy of the Oracle ACFS filesystem to protect against data loss.

  • Oracle ASM Dynamic Volume Manager (Oracle ADVM): Provides volume management services and disk driver interface to clients.

  • Oracle ASM Cluster Filesystem Snapshots: Provides point-in-time copy of up to 63 snapshot images with Oracle single instance and RAC environments with 11gR2.

 

Summary


In this chapter we discussed the concepts of High Availability Disaster Recovery and auxiliary topics. We also discussed a framework to design a Business Continuity Plan (BCP) that can be used to map business processes to IT infrastructure needs for mission-critical Oracle application environments. Among the core topics, we have covered:

  • High Availability concepts

  • How Oracle 11g RAC provides High Availability

  • High Availability solutions for Oracle 11g R1 and 11g R2 Real Application Clusters (RAC)

After explaining High Availability, we discussed how each of the various Oracle technologies provide the Maximum Availability Architecture (MAA) for the large data center environment as well as how to leverage these to best achieve maximum Return on Investment (ROI) within the Oracle data center. Finally, we explained why Oracle RAC is a high available solution as well as why it is not a disaster recovery solution.

In the next chapter, we will provide you with a detailed blueprint of how to design a solid Oracle RAC infrastructure for your data center environment and how to select and implement hardware, storage, and software for a robust Oracle 11g RAC configuration in the best possible manner.

Oracle 11g R1 / R2 Real Application Clusters Handbook
Unlock this book and the full library FREE for 7 days
Start now