meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
| Oracle® High Availability Architecture and Best Practices 10g< /i> Release 1 (10.1) Part Number B10726-01 |
|
![]() Previous |
![]() Nex t |
This chapter describes how to restore redundancy to your environment aft er a failure. It includes the following topics:
Whenever a component within an HA architecture fails, then the full protection, or fault tolerance, of the architecture is compromised and possible single points of failure exist until the component is repaired. Restoring the HA architecture to full fault tolerance to reestablish full RAC, Data G uard, or MAA protection requires repairing the failed component. While full fault tolerance may be sacrificed during a scheduled outa ge, the method of repair is well understood because it is planned, the risk is controlled, and it ideally occurs at times best suited for continued application availability. However, for unscheduled outages, the risk of exposure to a single point of failure must be clearly understood.
This chapter describes the steps needed to restore database fault toler ance. It includes the following topics:
For RAC environments:
For Data Guard and MAA environments:
Ensuring that application services fail over quickly and automatica lly within a RAC cluster, or between primary and secondary sites, is important when planning for both scheduled and unscheduled outag es. It is also important to understand the steps and processes for restoring failed instances or nodes within a RAC cluster or databa ses between sites, to ensure that the environment is restored to full fault tolerance after any errors or issues are corrected.
< a name="1005969">Adding a failed node back into the cluster or restarting a failed RAC instance is easily done af ter the core problem that caused the specific component to originally fail has been corrected. However, the following are additional considerations.
How an application runs w ithin a RAC environment (similar to initial failover) also dictates how to restore the node or instance, as well as whether to perfor m other processes or steps.
After the problem that caused the initial node or instance fail ure has been corrected, a node or instance can be restarted and added back into the RAC environment at any time. However, there may b e some performance impact on the current workload when rejoining the node or instance. Table 11-1< /a> summarizes the performance impact of restarting or rejoining a node or instance.
Therefore, it is important to consider the following when restoring a node or RAC instance:
See Also:
|
The rest of this section includes the following topics:
< /a>After a failed node has been brought back into the cluster and its i nstance has been started, RAC's Cluster Ready Services (CRS) automatically manages the virtual IP address used for the node and the s ervices supported by that instance automatically. A particular service may or may not be started for the restored instance. The decis ion by CRS to start a service on the restored instance depends on how the service is configured and whether the proper number of inst ances are currently providing access for the service. A service is not relocated back to a preferred instance if the service is still being provided by an available instance to which it was moved by CRS when the initial failure occurred. CRS restarts services on the restored instance if the number of instances that are providing access to a service across the cluster is less than the number of pr eferred instances defined for the service. After CRS restarts a service on a restored instance, CRS notifies registered applications of the service change.
For example, suppose the HR service is defined with instances A and B as preferred and instances C and D as available in case of a failure. If instance B fails and CRS starts up the HR service on C aut omatically, then when instance B is restarted, the HR service remains at instance C. CRS does not automatically relocate a service ba ck to a preferred instance.
Suppose a different scenario in which the HR service is defined with instances A, B, C, and D as preferred and no instances defined as available, spreading the service across all nodes in the clus ter. If instance B fails, then the HR service remains available on the remaining three nodes. CRS automatically starts the HR service on instance B when it rejoins the cluster because it is running on fewer instances than configured. CRS notifies the applications th at the HR service is again available on instance B.
After a RAC instance has been restored, additional steps may be required, depending on the current resource utilization and performance of the system, the ap plication configuration, and the network load balancing that has been implemented.
Existing connections (which may have failed over or started as a new session) on the surviving RAC instances, are not automatically redistrib uted or failed back to an instance that has been restarted. Failing back or redistributing users may or may not be necessary, dependi ng on the current resource utilization and the capability of the surviving instances to adequately handle and provide acceptable resp onse times for the workload. If the surviving RAC instances do not have adequate resources to run a full workload or to provide accep table response times, then it may be necessary to move (disconnect and reconnect) some existing user connections to the restarted ins tance.
New connections are started as they are needed, on the least-used node, assuming con nection load balancing has been configured. Therefore, the new connections are automatically load-balanced over time.
An application service can be:
This is valuable for modularizing appli
cation and database form and function while still maintaining a consolidated data set. For the cases where an application is partitio
ned or has a combination of partitioning and non-partitioning, the response time and availability aspects for each service should be
considered. If redistribution or failback of connections for a particular service is required, then you can rebalance workloads manua
lly with the DBMS_SERVICE.disconnect_session PL/SQL procedure. You can use this procedure to disconnect sessions associa
ted with a service while the service is running.
For load-balancing application services ac ross multiple RAC instances, Oracle Net connect-time failover and connection load balancing are recommended. This feature does not re quire changes or modifications for failover or restoration. It is also possible to use hardware-based load balancers. However, there may be limitations in distinguishing separate application services (which is understood by Oracle Net Services) and restoring an inst ance or a node. For example, when a node or instance is restored and available to start receiving new connections, a manual step may be required to include the restored node or instance in the hardware-based load balancer logic, whereas Oracle Net Services does not require manual reconfiguration.
Table 11-2 summarize s the considerations for new and existing connections after an instance has been restored. The considerations differ depending on whe ther the application services are partitioned, nonpartitioned, or have a combination of each type. The actual redistribution of exist ing connections may or may not be required depending on the resource utilization and response times.
| Application Services | Failback or Restore Existing Connections | Failback or Restore New Connections |
|---|---|---|
|
Partitioned |
Existing sessions a
re not automatically relocated back to the restored instance. Use |
Automatically routes to the restored instance by using the Oracle Net Services configuration.< /p> |
|
Nonpartitioned p> |
No action is necessary unless the load needs to be rebalanced, becau se restoring the instance means that the load there is low. If the load needs to be rebalanced, then the same problems are encountere d as if application services were partitioned. |
Automatically routes to the restored instance (because its load should be lowest) by using the Oracle Net Services configuration |
| Type of Standby Database | SQL Statement |
|---|---|
|
Physical |
|
|
Logical |
|
| Type of Standby Database strong> | SQL Statement |
|---|---|
|
Physical |
|
|
Logical |
|
You may have to reenable the production database remote archive destination. Quer
y the V$ARCHIVE_DEST_STATUS view first to see the current state of the archive destinations:
SELECT DEST_ID, DEST_NAME, STATUS, PROTECTION_MODE, DESTINATION, ERROR, SRL FROM V$ARCHIVE_DEST _STATUS; ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_n=ENABLE; ALTER SYSTEM SWITCH LOGFILE;
Verify log transport services between the
production and standby databases by checking for errors. Query V$ARCHIVE_DEST and V$ARCHIVE_DEST_STATUS vie
ws.
SELECT STATUS, TARGET, LOG_SEQUENCE, TYPE, PROCESS, REGISTER, ERROR FROM V$ARCHIVE_DEST; SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS!='INACTIVE';
For a physical standby databas e, verify that there are no errors from the managed recovery process and that the recovery has applied the archived redo logs.
SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD; SELECT P ROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS FROM V$MANAGED_STANDBY; < a name="1008209">
For a logical standby database, verify that there are no errors from the logical standby process and that the recovery has applied the archived redo logs.
SELECT THREAD#, SEQUENCE# SEQ# < a name="1008289"> FROM DBA_LOGSTDBY_LOG LOG, DBA_LOGSTDBY_PROGRESS PROG WHERE PROG.APPLIED_SCN BETW EEN LOG.FIRST_CHANGE# AND LOG.NEXT_CHANGE# ORDER BY NEXT_CHANGE#;
If you had to change the protection mode of the production dat abase from maximum protection to either maximum availability or maximum performance because of the standby database outage, then chan ge the production database protection mode back to maximum protection depending on your business requirements.
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];
Following an unplanne d outage of the standby database that requires a full or partial datafile restoration (such as data or media failure), full fault tol erance is compromised until the standby database is brought back into service. Full database protection should be restored as soon as possible. Note that using a Hardware Assisted Resilient Database configuration can prevent this type of problem.
The following steps are required to restore full fault tolerance after data failure of t he standby database:
The root cause of the outage should be investigated and action taken to prevent the problem from occurring again.
Only the affected datafiles need to be restored on to the standby site.
Archived redo log files may need to be restored to recover the restored data files up to the configured lag.
For physical standby databases:
For logical standby databases , initiate complete media recovery for the affected files. Consider the following:
After the standby database has been ree stablished, start the standby database.
| Type of Standby Databa se | SQL Statement |
|---|---|
|
Physical |
|
|
Logical |
|
| Type of S tandby Database | SQL Statement |
|---|---|
|
Physical |
<
p class="TB">RECOVER MANAGED STANDBY DATABASE DISCONNECT; |
|
Logical |
|
Verify log transport services on the new production database
by checking for errors when querying V$ARCHIVE_DEST and V$ARCHIVE_DEST_STATUS.
SELECT STATUS, TARGET, LOG_SEQUENCE, TYPE, PROCESS, REGISTER, ERROR FROM V$ARCHIVE_DEST; SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS != 'INACTIVE';
For a physical standby database, verify that ther e are no errors from the managed recovery process and that the recovery has applied archived redo logs.
SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD; SELECT PROCESS, STATUS, THREAD# , SEQUENCE#, CLIENT_PROCESS FROM V$MANAGED_STANDBY;< p class="BP">For a logical standby database, verify that there are no errors from the logical standby process and that the recovery h as applied archived redo logs.
SELECT THREAD#, SEQUENCE# SEQ# FR OM DBA_LOGSTDBY_LOG LOG, DBALOGSTDBY_PROGRESS PROG WHERE PROG.APPLIED_SCN BETWEEN LOG.FIRST_CHANGE# AND LOG .NEXT_CHANGE# ORDER BY NEXT_CHANGE#;
If the production database is activated because it was flashed back to correct a logical error or because it was restored and recovered to a point in time, then the corresponding standby database may require additional maintenance. No additional work is required if the production database did complete recovery with no resetlogs .
After activating the production database, execute the queries in the following table.
| Database | < th class="Informal" align="left" valign="bottom" scope="col"> Action|
|---|---|
|
Physical standby database |
SHUTDOWN IMMEDIATE; /* if necessary */ STARTUP MOUNT; FLASHBACK DATABASE TO SCN resetlogs_change#_minus_ 2; |
|
Logical standby database |
|
If a dual failure affecting both the standby and production databases occurs, then you need to re-create the production database first. Because the sites are identical, the productio n database can be created wherever the most recent backup resides.
Table 11-3 summarizes the recovery strategy depending on the type of backups that are available.
| Available Bac kups | Re-Creating the Production Database |
|---|---|
|
Local backup on production and standby databases |
<
td class="Formal">
|
|
a>
Local backup only on standby database. Tape backups on standby database. |
Restore the local standby backup to the standby database. Recover and activate the database as the new produc tion database. |
|
T ape backups only |
Restore tape backups locally. Recover the database and activate it as the new production database. |
After the produc tion database is re-created, follow the steps for creating a new standby database that are described in Oracle Data Guard Concepts and Administration.