< meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

Skip Headers

Oracle® High Availability Architecture and Best Practices
10g< /i> Release 1 (10.1)

Part Number B10726-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Content
s
Contents
Go to Index
Index
Go to Master Index
Master  Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Nex t
View PDF

11
Restoring Fault Tolerance
< /h1>

This chapter describes how to restore redundancy to your environment aft er a failure. It includes the following topics:

Restoring Full Tolerance

Whenever a component within an HA architecture fails, then the full protection, or fault tolerance, of the architecture is compromised and possible single points of failure exist until the component is repaired. Restoring the HA architecture to full fault tolerance to reestablish full RAC, Data G uard, or MAA protection requires repairing the failed component. While full fault tolerance may be sacrificed during a scheduled outa ge, the method of repair is well understood because it is planned, the risk is controlled, and it ideally occurs at times best suited for continued application availability. However, for unscheduled outages, the risk of exposure to a single point of failure must be clearly understood.

This chapter describes the steps needed to restore database fault toler ance. It includes the following topics:

For RAC environments:

For Data Guard and MAA environments:

Restoring Failed Nodes or Instances in a RAC Cluster

Ensuring that application services fail over quickly and automatica lly within a RAC cluster, or between primary and secondary sites, is important when planning for both scheduled and unscheduled outag es. It is also important to understand the steps and processes for restoring failed instances or nodes within a RAC cluster or databa ses between sites, to ensure that the environment is restored to full fault tolerance after any errors or issues are corrected.

< a name="1005969">

Adding a failed node back into the cluster or restarting a failed RAC instance is easily done af ter the core problem that caused the specific component to originally fail has been corrected. However, the following are additional considerations.

How an application runs w ithin a RAC environment (similar to initial failover) also dictates how to restore the node or instance, as well as whether to perfor m other processes or steps.

After the problem that caused the initial node or instance fail ure has been corrected, a node or instance can be restarted and added back into the RAC environment at any time. However, there may b e some performance impact on the current workload when rejoining the node or instance. Table 11-1< /a> summarizes the performance impact of restarting or rejoining a node or instance.

Table 11-1 Performance Impact of Resta rting or Rejoining a Node or Instance  
Action Impact on Runtime Application

Restarting a node or rejoining a node into a cluster

There may be some potential performance impact while the reconfiguration occurs to add this node back into the cluster. There may or may not be an impact on the performance of the running application, but this should be evaluated.

Restarting or rejoi ning a RAC instance

When you restart a RAC instance, there may be s ome potential performance impact while lock reconfiguration takes place. Evaluation tests show the impact on current applications to be minimal, but they should be evaluated with an appropriate test workload.

Therefore, it is important to consider the following when restoring a node or RAC instance:

< /a>

The rest of this section includes the following topics:

< /a>

Recovering Service Availability< /font>

After a failed node has been brought back into the cluster and its i nstance has been started, RAC's Cluster Ready Services (CRS) automatically manages the virtual IP address used for the node and the s ervices supported by that instance automatically. A particular service may or may not be started for the restored instance. The decis ion by CRS to start a service on the restored instance depends on how the service is configured and whether the proper number of inst ances are currently providing access for the service. A service is not relocated back to a preferred instance if the service is still being provided by an available instance to which it was moved by CRS when the initial failure occurred. CRS restarts services on the restored instance if the number of instances that are providing access to a service across the cluster is less than the number of pr eferred instances defined for the service. After CRS restarts a service on a restored instance, CRS notifies registered applications of the service change.

For example, suppose the HR service is defined with instances A and B as preferred and instances C and D as available in case of a failure. If instance B fails and CRS starts up the HR service on C aut omatically, then when instance B is restarted, the HR service remains at instance C. CRS does not automatically relocate a service ba ck to a preferred instance.

Suppose a different scenario in which the HR service is defined with instances A, B, C, and D as preferred and no instances defined as available, spreading the service across all nodes in the clus ter. If instance B fails, then the HR service remains available on the remaining three nodes. CRS automatically starts the HR service on instance B when it rejoins the cluster because it is running on fewer instances than configured. CRS notifies the applications th at the HR service is again available on instance B.

See Also:

Considerations for Client Connections After Restoring a RAC Instance

After a RAC instance has been restored, additional steps may be required, depending on the current resource utilization and performance of the system, the ap plication configuration, and the network load balancing that has been implemented.

Existing connections (which may have failed over or started as a new session) on the surviving RAC instances, are not automatically redistrib uted or failed back to an instance that has been restarted. Failing back or redistributing users may or may not be necessary, dependi ng on the current resource utilization and the capability of the surviving instances to adequately handle and provide acceptable resp onse times for the workload. If the surviving RAC instances do not have adequate resources to run a full workload or to provide accep table response times, then it may be necessary to move (disconnect and reconnect) some existing user connections to the restarted ins tance.

New connections are started as they are needed, on the least-used node, assuming con nection load balancing has been configured. Therefore, the new connections are automatically load-balanced over time.

An application service can be:

This is valuable for modularizing appli cation and database form and function while still maintaining a consolidated data set. For the cases where an application is partitio ned or has a combination of partitioning and non-partitioning, the response time and availability aspects for each service should be considered. If redistribution or failback of connections for a particular service is required, then you can rebalance workloads manua lly with the DBMS_SERVICE.disconnect_session PL/SQL procedure. You can use this procedure to disconnect sessions associa ted with a service while the service is running.

For load-balancing application services ac ross multiple RAC instances, Oracle Net connect-time failover and connection load balancing are recommended. This feature does not re quire changes or modifications for failover or restoration. It is also possible to use hardware-based load balancers. However, there may be limitations in distinguishing separate application services (which is understood by Oracle Net Services) and restoring an inst ance or a node. For example, when a node or instance is restored and available to start receiving new connections, a manual step may be required to include the restored node or instance in the hardware-based load balancer logic, whereas Oracle Net Services does not require manual reconfiguration.

Table 11-2 summarize s the considerations for new and existing connections after an instance has been restored. The considerations differ depending on whe ther the application services are partitioned, nonpartitioned, or have a combination of each type. The actual redistribution of exist ing connections may or may not be required depending on the resource utilization and response times.

Table 11-2 Restoration and Connection Failback  

Figure 11-1 shows a 2-node partitioned RAC databas e. Each instance services a different portion of the application (HR and Sales). Client processes connect to the appropriate instance based on the service they require.

Figure 1 1-1 Partitioned 2-Node RAC Database

Text description of maxav015.gif
follows

Text description of the illustration maxav015.gif

< p class="BP">Figure 11-2 shows what happens when one RAC instance fails.

Figure 11-2 RAC Instance Failover in a Partitioned Database< /font>

Text description of maxav016.gif follows

Text description of the illustration maxav016.gif

If one RAC instance fails, t hen the service and existing client connections can be automatically failed over to another RAC instance. In this example, the HR and Sales services are both supported by the remaining RAC instance. In addition, new client connections for the Sales service can be ro uted to the instance now supporting this service.

After the failed instance has been repair ed and restored to the state shown in Figure 11-1 and the Sales service is relocated to the re stored instance failed-over clients and any new clients that had connected to the Sales service on the failed-over instance may need to be identified and failed back. New client connections, which are started after the instance has been restored, should automaticall y connect back to the original instance. Therefore, over time, as older connections disconnect, and new sessions connect to the Sales service, the client load migrates back to the restored instance. Rebalancing the load immediately after restoration depends on the r esource utilization and application response times.

Figure&nb sp;11-3 shows a nonpartitioned application. Services are evenly distributed across both active instances. Each instance has a mix of client connections for both HR and Sales.

Figure 11-3 Nonpartitioned RAC Instances

Text description of maxav0
17.gif follows

Text description of the illustration maxav017.gif

If one RAC instance fails, then CRS moves the services that were running on the failed instance. In addition, new client connections are routed only to the available RAC instances that offer that service.

After the failed instance has been repaired and restored to the state shown in Figure 11-3, s ome clients may need to be moved back to the restored instance. For nonpartitioned applications, identifying appropriate services is not required for rebalancing the client load among all available instances. Also, this is necessary only if a single instance is not able to adequately service the requests.

New client connections that are started after the instance has been restored should automatically connect back to the restored instance because it has a smaller load. Therefore, over time, as older connections disconnect and new sessions connect to the restored instance, the client load will again evenly balance ac ross all available RAC instances. Rebalancing the load immediately after restoration depends on the resource utilization and applicat ion response times.

Restoring the Standby Database After a Failover

Following an unplanned production database outage that requires a failover, full fault tolerance is compromised until the physical or logical standby database is reestablished. Full database protection should be restored as soon as possible. Steps for restoring fault toleran ce differ slightly for physical and logical standby databases.

Standby databases do not nee d to be reinstantiated because of Oracle's Flashback Database feature. Flashback Database:

  • Saves hours of database restoration time
  • Reduces overall complexity in restoring fault tolerance
  • Reduces the time that the syst em is vulnerable because the standby database is re-created more quickly

This section includes the following topics:

Restoring a Physical Standby Database After a Failover

The following steps are required to restore a physical standby database after a failover. The steps assume that archived redo logs and sufficient flashback log data are available.

Step 1P: Retrieve STANDBY_BECAME_PRIMARY_SCN

From the new production database, execute the following query:

SELECT TO_CHAR(STANDB
Y_BECAME_PRIMARY_SCN) FROM V$DATABASE;

Use this SCN to convert th e previous production database to a standby database.

Step 2P: Flash Back the Previous Production Database

Log on to the previous production database and execute the following statements:

SHUTDOWN IMMEDIATE; /*if necessary */
STARTUP MOUNT;
FLASHBACK DATABA
SE TO SCN standby_became_primary_scn;

If there is insufficient fl ashback data, then see Oracle Data Guard Concepts and Administration about creating a new standby database.

Step 3P: Mount New Standby Database From Previous Production Database

Mounting the new standby database requires the following substeps:

< ol class="LN1" type="1">
  • Turn off flashback mode. This deletes the flashback logs, which are obsolete after the standby control file is restored.
        ALTER DATABASE FLASH
    BACK OFF;
    
    
  • Create the standby control fil e.
        ALTER DATABASE CREATE STANDBY CONTROLFILE AS controlfile_name;
        SHUTDOWN IMMEDIATE;
    
    
  • Issue operating system copy commands to replace the current control files with the new standby control files.
  • Mount the new standby database with the corresponding standby control file.
        STARTUP MOUNT;
     
    
  • Ensure that the standby listener is running.
        LSNRCTL STAT list_name;
    
  • Step 4P: A rchive to New Standby Database From New Production Database

    Before t he new standby database was created, the current production remote standby archive destination probably stopped with an error and is no longer shipping files to the remote destination. To restart remote archiving, you may have to reenable the standby archive destina tion.

    Query the V$ARCHIVE_DEST_STATUS view to see the current state of the arc hive destinations.

    SELECT DEST_ID, DEST_NAME, STATUS, PROTECTION_MODE, DESTINATION, ERROR,
    SRL
        FROM V$ARCHIVE_DEST_STATUS;
    
    

    Enable the remote archive destination.

    ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_
    n=ENABLE;
    
    

    Switch a redo log file and verify that it was sent successfully.

    ALTER SYSTEM SWITCH LOGFILE;
    SELECT DEST_ID, DEST_NAME
    , STATUS, PROTECTION_MODE, DESTINATION, ERROR, SRL
        FROM V$ARCHIVE_DEST_STATUS;
    
    

    Shipping the archived redo log from the new production database notifies the standby database of the new production database incarnation number.

    Step 5P: Start Managed Recovery

    Start managed recovery or real-time apply managed recovery with one of the following statements:

    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;
    
    

    or

    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE 
    DISCONNEC
    T;
    
    

    Ensure that recovery is applying the archived redo logs.

    < pre class="CE">SELECT * FROM V$MANAGED_STANDBY;

    Step 6P: Restart MRP After It Encounters the End-of-Redo Marker

    The managed recovery process (MRP) stops after it encounters the end-of-r edo marker that demarcates when the Data Guard failover was completed in the redo stream. This is not an error. Restart MRP, and it w ill continue with no problem.

    Restoring a Logical Standby Database After a Failover

    The following steps are required to restore a logical standby database after a failover:

    Step 1 L: Retrieve END_PRIMARY_SCN

    On the new production database, query fo r the SCN at which the previous standby database became the new production database.

    SELECT
     VALUE AS TO_CHAR(STANDBY_BECAME_PRIMARY_SCN) FROM DBA_LOGSTDBY_PARAMETERS
    WHERE NAME = 'END_PRIMARY_SCN';
    
    
    
    

    Step 2L: F lash Back the Previous Production Database

    You can create a new logi cal standby database by mounting the previous production database, flashing it back to STANDBY_BECAME_PRIMARY_SCN, and t hen enabling the previous guard level setting.

    SHUTDOWN IMMEDIATE;
    ST
    ARTUP MOUNT;
    FLASHBACK DATABASE TO SCN standby_became_primary_scn;
    ALTER DATABASE GUARD [
    ALL | STANDBY | NONE];
    

    Step 3L: Open New Logical Standby Database and Start SQL Apply

    ALTER DATABASE OPEN RESETLOGS;
    ALTER DATABASE START LOGICAL STANDBY APPLY NEW PRIMARY dblink;
    
    

    You need to create a database link from the ne w logical standby database to the new production database if it does not already exist. Use the following syntax:

    CREATE PUBLIC DATABASE LINK dblink
      CONNECT TO system IDENTIFIED BY password
      USING 'service_name_of_new_p
    rimary_database';
    

    Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage

    The following steps are required to restore full fault tolerance after a scheduled secondary site or clus terwide outage:

    Step 1: Start the Standby Database

    You may have to restore the standby database from local backups, local tape backups, or from the primary site backups if t he data in the secondary site has been damaged. Re-create the standby database from the new production database by following the step s for creating a standby database in Oracle Data Guard Concepts and Administration.

    After the standby database has been reestablish ed, start the standby database.

    Application Services Failback or Restore Existing Connections Failback or Restore New Connections

    Partitioned

    Existing sessions a re not automatically relocated back to the restored instance. Use DBMS_SERVICE.disconnect_session to manually disconnect sessions and allow them to be reestablished on one of the remaining instances that provides the service.

    Automatically routes to the restored instance by using the Oracle Net Services configuration.< /p>

    Nonpartitioned

    No action is necessary unless the load needs to be rebalanced, becau se restoring the instance means that the load there is low. If the load needs to be rebalanced, then the same problems are encountere d as if application services were partitioned.

    Automatically routes to the restored instance (because its load should be lowest) by using the Oracle Net Services configuration

    Step 2: St art Recovery

    Type of Standby Database SQL Statement

    Physical

    ST ARTUP MOUNT;

    Logical

    STARTUP;

    Type of Standby Database SQL Statement

    Physical

    RECOVER MANAGED STANDBY DATABASE DISCONNECT;

    Logical

    ALTER DAT ABASE START LOGICAL STANDBY APPLY;

    Step 3: Verify Log Transport Services on Production Database

    < !--/TOC=h2-->

    You may have to reenable the production database remote archive destination. Quer y the V$ARCHIVE_DEST_STATUS view first to see the current state of the archive destinations:

    SELECT DEST_ID, DEST_NAME, STATUS, PROTECTION_MODE, DESTINATION, ERROR, SRL 
    FROM V$ARCHIVE_DEST
    _STATUS;
    ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_n=ENABLE;
    ALTER
    SYSTEM SWITCH LOGFILE;
    
    

    Verify log transport services between the production and standby databases by checking for errors. Query V$ARCHIVE_DEST and V$ARCHIVE_DEST_STATUS vie ws.

    SELECT STATUS, TARGET, LOG_SEQUENCE, TYPE, PROCESS, REGISTER, ERROR 
    
        FROM V$ARCHIVE_DEST;
    SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS!='INACTIVE';
    

    Step 4: Verify that Recove ry is Progressing on Standby Database

    For a physical standby databas e, verify that there are no errors from the managed recovery process and that the recovery has applied the archived redo logs.

    SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD; SELECT P ROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS FROM V$MANAGED_STANDBY; < a name="1008209">

    For a logical standby database, verify that there are no errors from the logical standby process and that the recovery has applied the archived redo logs.

    SELECT THREAD#, SEQUENCE# SEQ# 
    <
    a name="1008289">    FROM DBA_LOGSTDBY_LOG LOG, DBA_LOGSTDBY_PROGRESS PROG 
        WHERE PROG.APPLIED_SCN BETW
    EEN LOG.FIRST_CHANGE# AND LOG.NEXT_CHANGE# 
        ORDER BY NEXT_CHANGE#;
    

    Step 5: Restore Production Database Protection Mode

    If you had to change the protection mode of the production dat abase from maximum protection to either maximum availability or maximum performance because of the standby database outage, then chan ge the production database protection mode back to maximum protection depending on your business requirements.

    ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];
    
    < /a>See Also:

    "Changing the Data Protection Mode"

    Restoring Fault T olerance after a Standby Database Data Failure

    Following an unplanne d outage of the standby database that requires a full or partial datafile restoration (such as data or media failure), full fault tol erance is compromised until the standby database is brought back into service. Full database protection should be restored as soon as possible. Note that using a Hardware Assisted Resilient Database configuration can prevent this type of problem.

    See Also:

    Appendix A, "Hardware Assisted Resilient Data (HARD) Initiative"

    The following steps are required to restore full fault tolerance after data failure of t he standby database:

    Step 1: Fix the Cause of the Outage

    The root cause of the outage should be investigated and action taken to prevent the problem from occurring again.

    Step 2: Restore the Backup of Affected Datafiles

    Only the affected datafiles need to be restored on to the standby site.

    Step 3: Restore Required Archived Redo Log Files

    Archived redo log files may need to be restored to recover the restored data files up to the configured lag.

    For physical standby databases:

    • If the archived redo logs required for recovery are available on the standby system in a confi gured archive destination, then the managed recovery process automatically finds and applies them as needed. No restoration is necess ary.
    • If the required archived redo logs have been deleted from the standby sys tem but are still available on the production system, then the fetch archive log (FAL) process is automatically invoked to transfer t hem to the standby system. No restoration is necessary.

    For logical standby databases , initiate complete media recovery for the affected files. Consider the following:

    • < a name="1008228">If the archived redo logs required for recovery are available on the standby system in a configured archive dest ination, then the recovery process automatically finds and applies them as needed. No restoration is necessary.
    • If the required archived redo logs have been deleted from the standby system, then they must be res tored to the standby system. Complete media recovery for the affected files after the necessary archived redo logs are available.

    Step 4 : Start the Standby Database

    After the standby database has been ree stablished, start the standby database.

    Type of Standby Databa se SQL Statement

    Physical

    STARTUP MOUNT;

    Logical

    STARTUP;

    St ep 5: Start Recovery or Apply

    Type of S tandby Database SQL Statement

    Physical

    < p class="TB">RECOVER MANAGED STANDBY DATABASE DISCONNECT;

    Logical

    ALTER DATABASE START LOGICAL STANDBY APPLY;

    Verify log transport services on the new production database by checking for errors when querying V$ARCHIVE_DEST and V$ARCHIVE_DEST_STATUS.

    SELECT STATUS, TARGET, LOG_SEQUENCE, TYPE, PROCESS, REGISTER, ERROR FROM 
    V$ARCHIVE_DEST;
    
    SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS != 'INACTIVE';
    

    For a physical standby database, verify that ther e are no errors from the managed recovery process and that the recovery has applied archived redo logs.

    SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD;
    SELECT PROCESS, STATUS, THREAD#
    , SEQUENCE#, CLIENT_PROCESS
        FROM V$MANAGED_STANDBY;
    
    
    < p class="BP">For a logical standby database, verify that there are no errors from the logical standby process and that the recovery h as applied archived redo logs.

    SELECT THREAD#, SEQUENCE# SEQ# 
        FR
    OM DBA_LOGSTDBY_LOG LOG, DBALOGSTDBY_PROGRESS PROG
        WHERE PROG.APPLIED_SCN BETWEEN LOG.FIRST_CHANGE# AND LOG
    .NEXT_CHANGE#
        ORDER BY NEXT_CHANGE#;
    

    < font face="Arial, Helvetica, sans-serif" color="#330099">Step 8: Restore Production Database Protection Mode

    Restoring Fault Tolerance After the Production Database Has Opened Reset logs

    If the production database is activated because it was flashed back to correct a logical error or because it was restored and recovered to a point in time, then the corresponding standby database may require additional maintenance. No additional work is required if the production database did complete recovery with no resetlogs .

    After activating the production database, execute the queries in the following table.

    Database Query

    Production database

    SELECT TO_CHAR(RESETLOGS_CHANGE# - 2) FROM V$DATAB ASE;

    Physical standby database

    SELECT TO_CHAR(CURRENT_SCN) FROM V$DATABASE;

    Logical standby database

    SELECT APPLIED_SCN FROM DBA _LOGSTDBY_PROGRESS;

    Scenario 1: SCN on Standby is Behind Resetlogs SCN on Production

    Scenario 2: SCN on Standby is Ahead of Resetlogs SCN on Pro duction

    < th class="Informal" align="left" valign="bottom" scope="col">Action
    Database

    Physical standby database

    1. Ensure that the standby database has received a new archived redo log file from the production database.
    2. < li class="TLN" type="1" value="2">Flash back the database to the SCN that is 2 SCNs before the resetlogs occurr ed.
      SHUTDOWN IMMEDIATE; /* if necessary */
      STARTUP MOUNT;
      FLASHBACK DATABASE TO SCN resetlogs_change#_minus_
      2;
      
    3. Restart recovery. See "Step 5P: Start Managed Recovery".

    Logical standby database

    1. Retrieve production database flashback time or SCN. The flashback time or SCN needs to be extracted from the production databa se alert log.
    2. Stop SQL Apply on the logical standby database.
      ALTER DATABASE STOP LOGICAL STANDBY APPLY;
      SELECT APPLIED_SCN FROM DBA_LOGSTDBY_
      PROGRESS;
      
    3. Flash back the logical standby database.

      Issue the following SQL statements to flash back the logical standby database to the same time used to flash back the primary database.

      SHUTDOWN;
      STARTUP MOUNT EXCLUSI
      VE;
      FLASHBACK DATABASE TO TIMESTAMP time_of_primary_
      database_flashback;
      ALTER DATABASE OPEN READ ONLY;
      SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;
      
    4. Open the logical standby database with resetlogs
      SHUTDOWN;
      STARTUP MOUNT EXCLUSIVE;
      ALTER DATABASE OPEN RESETLOGS;
      
    5. Archive the current log on the primary database.
      
      ALTER SYSTEM ARCHIVE LOG CURRENT;
      
    6. Start SQL Apply.
      ALTER DATABASE START LOGICAL STANDBY APPLY;
      

    Restoring Fault Tolerance a fter Dual Failures

    If a dual failure affecting both the standby and production databases occurs, then you need to re-create the production database first. Because the sites are identical, the productio n database can be created wherever the most recent backup resides.

    Table 11-3 summarizes the recovery strategy depending on the type of backups that are available.

    Table 11-3 Re-Creating the Production and Standby Databases  
    < td class="Formal">

    Restore backup from the production database. Recover and activate the databas e as the new production database.

    Available Bac kups Re-Creating the Production Database

    Local backup on production and standby databases

    Local backup only on standby database. Tape backups on standby database.

    Restore the local standby backup to the standby database. Recover and activate the database as the new produc tion database.

    T ape backups only

    Restore tape backups locally. Recover the database and activate it as the new production database.

    After the produc tion database is re-created, follow the steps for creating a new standby database that are described in Oracle Data Guard Concepts and Administration.