Skip Headers

Oracle® High Availability Architecture and Best Practices
10g Release 1 (10.1)

Part Number B10726-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master&n bsp;Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next< /font>
View PDF

10
Detailed Recovery Steps

This chapter describes the detailed recovery operations that are referred to in the outages and solutions tables in Chapter 9, "Recovering from Outages". It includes the following topics:

Sum mary of Recovery Operations

This chapter describes the detailed reco very operations that are referred to in the outages and solutions tables in Chapter 9, "Recovering from Outages". Table 10-1 summarizes the recovery operations that are described in this c hapter.

Table 10-1 Recovery Operations   < /tr>
Recovery Operation Description

Complete or Partial Site Failover

In the complete site fai lover scenario, existing connections fail and new connections are routed to a secondary or failover site. This occurs when there is a true disaster and where the application stack is replicated.

In the partial site failover s cenario, the primary site is intact, and the middle-tier applications need to be redirected after the database has failed over or swi tched over to a standby database on the secondary site. This configuration is not recommended if performance decreases dramatically b ecause of the greater latency between the application servers and the database.

Database Failover

A Data Guard failover is invoked in the database tier because of an unscheduled outa ge. If complete recovery is attempted, then there is minimal or no data loss. A subsequent complete or partial site failover must occ ur. The previous production database can be converted to a standby database by flashing back the database.

Database Swi tchover

A Data Guard switchover occurs in the database tier. Th e previous production database becomes the new standby database while the previous standby database becomes the new production databa se. A subsequent complete or partial site failover must occur. This is a scheduled or planned outage.

RAC Recovery< /p>

RAC automatically handles instance and node failures at a given sit e to provide continued access to the backend database. The reconnection to available instances can be transparent from an application standpoint and occurs only on the primary site. The application service migrates to the available or designated instances.

Apply Instance Failover

When a standby node requires maint enance, you can switch to another standby instance to avoid any impact on the production database and to ensure that the standby reco very does not fall behind. When the standby cluster requires maintenance or the standby cluster fails, if the maximum protection mode is enabled, then the production database needs to be downgraded to maximum availability mode or maximum performance mode.

Application failover

The client or application tier automatically fails over to one or more s urviving RAC instances when an instance or node failure occurs at the primary site.

See Also: "Recommendations for Fast Application Failover"

Recovery Solutions for Data Failures

When data failure occurs due to media corruption or media damage, you can use different recovery options that use the flash reco very area or switch over to the standby database. Alternatively, you can rebuild tables by fast imports or rebuild indexes in paralle l.

Recovering from User Error with Flashback Technology

When a user error causes transactional or logical data inconsistencies, you can resolve these problems by using flashback erro r correction technology at the row, transaction, table, or database levels.

RAC Rolling Upgrade

With RAC, you can apply patches for specific customer issues incrementally to one node or instance at a time, which enables continual application and database availability.

Upgrade with Logical Standby Dat abase

Data Guard enables you to upgrade the database on the sta ndby database and perform a switchover, which minimizes the overall scheduled outage time for applying patch sets or software upgrade s for the database.

Online Object Reorganization

Many scheduled outages related to the data server involve some reorganization of the database objects. They need to be accomplished w ith continued availability of the database. Oracle online object reorganization is used to manage the scheduled outages.

Complete or Partial Site Failover

This section describes the fo llowing client failover scenarios:

In the complete site failover scenario, existing connections fail and new connections are routed to a secondary or failover site. This occurs when t here is a true disaster and where the application stack is replicated.

In the partial site failover scenario, the primary site is intact, and the middle-tier applications need to be redirected after the database has failed o ver or switched over to a standby database on the secondary site. This configuration is not recommended if performance decreases sign ificantly because of the greater latency between the application servers and the database.

Complete Site Failover

< a name="1009379">

A wide-area traffic manager is implemented on the primary and secondary sites to provide the sit e failover function. The wide-area traffic manager can redirect traffic automatically if the primary site or a specific application o n the primary site is not accessible. It can also be triggered manually to switch to the secondary site for switchovers. Traffic is d irected to the secondary site only when the primary site cannot provide service due to an outage or after a switchover. If the primar y site fails, then user traffic is directed to the secondary site.

Figure 10-1 illustrates the network routes before site failover. Client requests enter the client tier of the primary site and travel by the WAN traffic manager. Client requests are sent through the firewall into the demilitarized zone (DMZ) to the ap plication server tier. Requests are then forwarded through the active load balancer to the application servers. They are then sent th rough another firewall and into the database server tier. The application requests, if required, are routed to a RAC instance. Respon ses are sent back to the application and clients by a similar path.

Figure 10-1 Network Routes Before Site Failover

Text description of the illustration maxav013. gif

Figure 10-2 illustrates the network routes a fter site failover. Client or application requests enter the secondary site at the client tier and follow exactly the same path on th e secondary site that they followed on the primary site.

Figure 10-2 Network Routes After Site Failover

Text des
cription of maxav028.gif follows

Text description of the illustration maxav028.gif

The following steps describe what happens to network traffic during a failover or switchover.

  1. The administrator has failed over or switched o ver the production database to the secondary site.
  2. The administrator st arts the middle-tier application servers on the secondary site.
  3. Typical ly, a Domain Name Service (DNS) administrator changes the wide-area traffic manager selection of the secondary site. Alternatively, t he selection can be made automatically for an entire site failure.The wide-area traffic manager at the secondary site returns the vir tual IP address of a load balancer at the secondary site. In this scenario, the site failover is accomplished by a DNS failover. The following is an example of a manual DNS failover:
    1. Change the DNS to point to the secondary site load balancer.
    2. Set TTL (Time to Li ve) to a short interval for the DNS propagation.
    3. Disable DNS on the primary site.
    4. Execute a DNS "push" to point to the secondary site.
    5. Wait until all failover operations are complete.
    6. Change TTL back to its normal setting on the DNS server.
    7. The secondary site load balance r directs traffic to the secondary site middle-tier application server.
    8. The secon dary site is ready to take client requests.

Failover also depends on the cl ient's web browser. Most browser applications cache the DNS entry for a period of time. Consequently, sessions in progress during an outage may not fail over until the cache timeout expires. The only way to resume service to such clients is to close the browser and restart it.

Partial Site Failover: Middle-Tier Applications Connect to a Remote Database Server

< /a>

This usually occurs after the database has been failed over or switched over to the secondary site and the middle- tier applications remain on the primary site. The following steps describe what happens to network traffic during a partial site fail over:

  1. The production database is failed over o r switched over to the secondary site.
  2. The middle-tier application serv ers reconnect to the database on the secondary site using configuration best practices described in "R ecommendations for Fast Application Failover".

Figure 10-3 shows the network routes after partial site failover. Client and application requ ests enter the primary site at the client tier and follow the same path to the database server tier as in Figure 10-1. When the requests enter the database server tier, they are routed to the database tier of the secondary site through any additional switches, routers, and possible firewalls.

Figure 10-3 Network Routes After Partial Site Failover

Text description of maxav031.gif follows

Text description of the illustration max av031.gif

Database Failover

Failover is the operation of taking the produ ction database offline on one site and bringing one of the standby databases online as the new production database. A failover operat ion can be invoked when an unplanned catastrophic failure occurs on the production database, and there is no possibility of recoverin g the production database in a timely manner.

Data Guard enables you to fail over by issuin g the SQL statements described in subsequent sections, by using Oracle Enterprise Manager, or by using the Oracle Data Guard broker c ommand-line interface.

See Also:

Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database failover

Data Guard failover is a series of steps to convert a sta ndby database into a production database. The standby database essentially assumes the role of production. A Data Guard failover is a ccompanied by a site failover to fail over the users to the new site and database. After the failover, the secondary site contains th e production database. The former production database needs to be re-created as a new standby database to restore resiliency. The sta ndby database can be quickly re-created by using Flashback Database. See "Restoring the Standby Databas e After a Failover".

During a failover operation, lit tle or no data loss may be experienced. The complete description of a failover can be found in Oracle Data Guard Concepts and Administration.

The rest of this section includes the following topics:

When to Use Data Guard Failover

Data Guard failover should be used only in the case of an emergency and should be initiated due to an unplanned outage such as:

< p class="BP">A failover requires that the initial production database be re-created as a standby database to restore fault tolerance to your environment. The standby database can be quickly re-created by using Flashback Database. See "R estoring the Standby Database After a Failover".

When Not to Use Data Guard Failover

Do not use Data Guard failover when the problem can be fixed locally in a timely manner or when Data Guard switchover can be used. For failover with complete recovery scenarios, either the production database is not acce ssible or cannot be restarted. Data Guard failover should not be used where object recovery or flashback technology solutions provide a faster and more efficient alternative.

Data Guard Failover Using SQL*Plus

This section includes the following topics:

Physical Standby Failover Using SQL*Plus

  1. Check for archive gaps.
    
      SELECT THREAD#, LOW_SEQUENCE#, HIGH_SEQUENCE# 
        FROM V$ARCHIVE_GAP;
    
    See Also:

    Oracle Data Guard Concepts and Administration f or more information about what to do if a gap exists

  2. Shut down other standby instances.
  3. Finish recovery.
        ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH;
    
    
  4. Check the database state. Query switchover status readiness by executing the follow ing statement:
        SELECT SWITCHOVER_STATUS FROM V$DATABASE;
    
    
  5. Convert the physical standby database to the production role.
        ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
    
    
  6. Restart instance.

Logical Standby Failover Using SQL*Plus

  1. Stop the current SQL Apply session.
        ALTER DATABASE STOP LOGICAL STANDBY APPLY;
    
    
  2. < a name="1006759">If there are additional logs to be registered, (for example, you can get to the primary database or you are usin g LGWR to the standby destination), then register the log files.
        ALTER DATABASE REGISTER L
    OGICAL LOGFILE 'file_name';
    
    
  3. Start the SQL Apply session using the NODELAY and FINISH clauses.
        ALTER DATABASE START LOGICAL STANDBY APPLY NODELAY FINISH;
    
    
  4. Activate the logical standby database.
        ALTER DATABASE ACT
    IVATE LOGICAL STANDBY DATABASE;
    
    

Failover has completed, and the new production database is available to process transactions.

Database Switchover

A database switchover performed by Oracle Data Guard is a planned transition that includes a series of steps to switch roles between a standby database and a production database. Thus, following a successful switchover operation, the standby database a ssumes the production role and the production database becomes a standby database. In a RAC environment, a switchover requires that o nly one instance is active for each database, production and standby. At times the term "switchback" is also used within the scope of database role management. A switchback operation is a subsequent switchover operation to return the roles to their original state.

Data Guard enables you to change these roles dynamically by issuing the SQL statements descr ibed in subsequent sections, or by using Oracle Enterprise Manager, or by using the Oracle Data Guard broker command-line interface. Using Oracle Enterprise Manager or the Oracle Data Guard broker command-line interface is described in Oracle Data Guard Broker.

Th is section includes the following topics:

When to Use Data Guard Switchover

Sw itchover is a planned operation. Switchover is the capability to switch database roles between the production and standby databases w ithout needing to instantiate any of the databases. Switchover can occur whenever a production database is started, the target standb y database is available, and all the archived redo logs are available. It is useful in the following situations:

When Not to Use Data Guard Switchover

Switchover is not possible or practical u nder the following circumstances:

Do not use Data G uard switchover when local recovery solutions provide a faster and more efficient alternative. The complete description of a switchov er can be found in Oracle Data Guard Concepts and Administration.

Data Guard Switchover Using SQL*Plus

If you are not using Oracle Enterprise Manager, then the high-level steps in this section can be executed with SQL*Plus. These steps are described in detail in Oracle Data Guard Concepts and Admin istration.

This section includes the following topics:

Physical Standby Switchover Using SQL*Plus

  1. < /a>Shut down all production and standby instances except one for each site.
  2. Stop active sessions on remaining active production instance

    To identify active se ssions, execute the following query:

        SELECT SID, PROCESS, PROGRAM 
        FROM V$SESSION 
        WHERE TYPE = 'USER' 
        AND SID <> (SELECT DISTINCT SID
    FROM V$MYSTAT);
    
    
  3. Check that the switchove r status on the production database is 'TO STANDBY'.
        SELECT SWITCHOVER_STATUS
     FROM V$DATABASE;
    
    
  4. Switch over the curren t production database to the standby database.
        ALTER DATABASE COMMIT TO SWITCHOVER TO STAN
    DBY [WITH SESSION SHUTDOWN];
    
    
  5. Start the n ew standby database.
        STARTUP MOUNT;
        RECOVER MANAGED STANDBY DATA
    BASE USING CURRENT LOGFILE DISCONNECT;
    
    
  6. C onvert the former standby database to a production database.
        ALTER DATABASE COMMIT TO SWIT
    CHOVER TO PRIMARY [WITH SESSION SHUTDOWN];
    
    
  7. < /a>Restart all instances.

Logical Standby Switchover Using SQL*Plus

  1. Prepare the production database to become the logical standby database.
        ALTER DATABASE PREPARE TO SWITCHOVER TO LOGICAL STANDBY;
    
    
  2. Prepare the logical standby database to become the production database.

    Following this step, logs start to ship in both directions, although the current production database does n ot process the logs coming from the current logical standby database.

        ALTER DATABASE PR
    EPARE TO SWITCHOVER TO PRIMARY;
    
    
  3. Commit t he production database to become the logical standby database.

    This is the phase where current transactions on the production database are cancelled. All DML-related cursors are invalidated, preventing new records from b eing applied. The end of redo (EOR) marker is recorded in the online redo log and then shipped (immediately if using real-time apply) to the logical standby database and registered.

        ALTER DATABASE COMMIT TO SWITCHOVER TO
     LOGICAL STANDBY;
    
    
  4. Commit the logical sta ndby database to become the production database.
        ALTER DATABASE COMMIT TO SWITCHOVER TO PR
    IMARY;
    
    
  5. Start the logical standby apply e ngine on the new logical standby database.

    If real-time apply is required, execute the following statement:

        ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;
    
    

    Otherwise execute the following statement:

        ALTER DATABASE START LOGICAL STANDBY APPLY;
    

RAC Recovery

This section includes the following topics:

RAC Recovery for Unscheduled Outages

This section includes the following topics:

Automatic Instance Recovery for Failed Instances

Instance failure occurs when software or hardware problems disable an instance. After ins tance failure, Oracle automatically uses the online redo log file to perform database recovery as described in this section.

Single Node Failu re in Real Application Clusters

Instance recovery in RAC does not in clude restarting the failed instance or the recovery of applications that were running on the failed instance. Applications that were running continue by using failure recognition and recovery as described in Oracle Real Application Clusters Installation and Configuration Guide. This provides consistent an d uninterrupted service in the event of hardware or software failures. When one instance performs recovery for another instance, the surviving instance reads redo log entries generated by the failed instance and uses that information to ensure that committed transac tions are recorded in the database. Thus, data from committed transactions is not lost. The instance that is performing recovery roll s back uncommitted transactions that were active at the time of the failure and releases resources used by those transactions.

Multiple Node F ailures in Real Application Clusters

When multiple node failures occ ur, as long as one instance survives, RAC performs instance recovery for any other instances that fail. If all instances of a RAC dat abase fail, then Oracle automatically recovers the instances the next time one instance opens the database. The instance that is perf orming recovery can mount the database in either shared or exclusive mode from any node of a RAC database. This recovery procedure is the same for Oracle running in shared mode as it is for Oracle running in exclusive mode, except that one instance performs instance recovery for all the failed instances in exclusive node.

Automatic Service Relocation

Service reliability is achieved by configuring and failing over among redundant instances. More instances are enabled to pr ovide a service than would otherwise be needed. If a hardware failure occurs and adversely affects a RAC database instance, then RAC automatically moves any services on that instance to another available instance. Then Cluster Ready Services (CRS) attempts to restar t the failed nodes and instances.

An installation can specify the "preferred" and "availabl e" configuration for each service. This configuration describes the preferred way to run the system, and is used when the service fir st starts up. For example, the ERP service runs on instance1 and instance2, and the HR service runs on instance3 when the system first starts. instance2 is available to run HR in the event of a failure or planned outage, and instance3 and instance4 are available to run ERP. The service configuration can be designed several ways.

RAC recognizes when a failure affects a service and automatically fails over the service and redistributes the clients across the surviving instance supporting th e service. In parallel, CRS attempts to restart and integrate the failed instances and dependent resources back into the system. Noti fication of failures occur at various levels including notifying external parties through Enterprise Manager and callouts, recording the fault for tracking, event logging, and interrupting applications. Notification occurs from a surviving fault domain when the fail ed domain is out of service. The location and number of fault domains serving a service is transparent to the applications. Auto rest art and recovery are automatic, including all the subsystems, not just database.

< h3 class="H2">RAC Recovery for Scheduled Outages

This section includes the following topics:

Disabling CRS-Managed Resources

When an outage occurs, RAC automatically restarts essential components. Components that are eligible f or automatic restart include instances, listeners, and the database as well as several subcomponents. Some scheduled administrative t asks require that you prevent components from automatically restarting. To perform scheduled maintenance that requires a CRS-managed component to be down during the operation, the resource must be disabled to prevent CRS from trying to automatically restart the comp onent. For example, to take a node and all of its instances and services offline for maintenance purposes, disable the instance and i ts services using either Enterprise Manager or SRVCTL, and then perform the required maintenance. Otherwise, if the node fails and then restarts, then CRS attempts to restart the instance during the administrative operation.

< !--TOC=h3-"1010875"-->

Planned Service Relocation

For a scheduled outage that requires an instance, node, or other component to be isolated, RAC provides the ability to relocate, disable, and enable services. Relocation migrates the service to another instance. The sessions can also be relocated. These interfaces also allow services, instances and databases to be selectively disabled while a repair, change, or upgrade is made and re-enabled after the change is complete. This ensures that the service is not started at the instance being repaired because of a dependency or a start operation on the service. The service is disabled on the instance at the b eginning of the planned outage. It is then enabled at the end of the maintenance outage.

Fo r example, to relocate the SALES service from instance1 to instance3 in order to perform sched uled maintenance on node1,the tasks can be performed using Enterprise Manager or SRVCTL commands. The follo wing shows how to use SRVCTL commands:

  1. Relocate the SALES service to instance3.
        srvctl relocate s
    ervice -d PROD -s SALES -i instance1 -t instance3
    
    
  2. Disable the SALES service on instance1 to prevent it from being relocated to instance1 while maintenance is performed.
        srvctl disable service -d PROD -s SALES -i instance1
    
    
  3. Stop instance1.
        srvctl stop instance -d PROD -i instance1
    
    
  4. Perform the scheduled maintenance.
  5. Star t instance1.
        srvctl start instance -D PROD -i instance1
    
    
  6. Re-enable the SALES service on instance1.
        srvctl enable service -d PROD -s SALES -i instance1
    
    
    < /li>
  7. If desired, relocate the SALES service running on in stance3 back to instance1.
        srvctl relocate service -d PROD -s SALES -i
     instance3 -t instance1
    

Apply Instance Failover

This section applies to MAA, with RAC and Data Guard on each site.

A standby database can have multiple standby instances. Only one instance can have the managed recovery proc ess (MRP) or the logical standby apply process (LSP). The instance with the MRP or LSP is called the apply instance.

When you have a RAC-enabled standby database, you can fail over the apply instance of the standby RAC envir onment. Failing over to another apply instance may be necessary when incurring a planned or unplanned outage that affects the apply i nstance or node. Note the difference between apply instance failover, which utilizes multiple instances of the standby database at th e secondary site, and Data Guard failover or Data Guard switchover, which converts the standby database into a production database. T he following occurs as a result of apply instance failover:

  • St andby recovery can continue even when the apply host is undergoing maintenance or incurs a failure. By not interrupting standby recov ery, you can ensure that a Data Guard failover or switchover can be completed within the tolerated MTTR.
  • The production database will not be interrupted and production downtime can be avoided even if you are usi ng maximum protection mode, because the subsequent network connection switches to another apply instance.

For apply failover to work correctly, "Configuration Best Practices for MAA"< a href="configbp.htm#1013204"> must be followed:

W hen you follow these configuration recommendations, apply instance failover is automatic for a scheduled or unscheduled outage on the primary instance, and all standby instances have access to archived redo logs. By definition, all RAC standby instances already have access to standby redo logs because they must reside on shared storage.

The method of rest arting the physical standby managed recovery process (MRP) or the logical standby apply process (LSP) depends on whether Data Guard B roker is being used. If the Data Guard Broker is in use, then the MRP or LSP is automatically restarted on the first available standb y instance if the primary standby instance fails. If the Data Guard Broker is not being used, then the MRP or LSP must be manually re started on the new standby instance. Consider using a shared file system, such as a clustered file system or a global file system, fo r the archived redo logs. A shared file system enables you to avoid reshipment of any unapplied archived redo logs that were already shipped to the standby.

See Also:

Oracle Data Guard Concepts and Administration for details about setting up cross-instance archiving

Perfor ming an Apply Instance Failover Using SQL*Plus

If apply instance fai lover does not happen automatically, then follow these steps to restart your production database, if necessary, and restart MRP or LS P following an unscheduled apply instance or node outage:

< a name="1009037">

Step 1: Ensure That the Chosen Standby Instance is Mounted

From the targeted stand by instance, run the following query.

SELECT OPEN_MODE, DATABASE_ROLE FROM V$DATABASE;



 

Logical

Type of Standby Database Output for M ounted Standby Database If Not Mounted, Choose a Different Target or Open Manually

Physical

MOUNTED, PHYSICAL STANDBY

STARTUP NOMOUNT

READ WRITE, LOGICAL STANDBY

STARTUP

Step 2: Verify Oracle Net Connection to the Chosen Standby Host

    Ensure that the standby listener is started.
        % lsnrctl status listener_name
    
    
  1. Validate the Oracle Net alias from all the production hosts in the cluster.
        % tnsping standby_database_connection_service_name
    
    

If the connection cannot be made, then consult Oracle Net Services Administrator's Guide for further troubleshooting.

< /a>

Step 3: Start Recovery on the Ch osen Standby Instance

Use the following statements for a physical st andby database:

RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;

Use the following statements for a logical standby database:

ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;

Step 4: Copy Archived Redo Logs to the New Apply Host

Optionally, copy the archived redo logs to the new apply host.

The copy is not necessary for a physical standby database. For a physical standby database, when the managed recovery process detects an archive gap, it requests the production archived redo logs to be resent automatically.

For a logical standby database, unapplied archive file names that have already been registered ar e not resent automatically to the logical standby database. These archived redo logs must be sent manaully to the same directory stru cture in the new apply host. You can identify the registered unapplied archived redo logs by executing a statement similar to the fol lowing:

SELECT LL.FILE_NAME, LL.THREAD#, LL.SEQUENCE#, LL.FIRST_CHANGE#, LL.NEXT_CHANGE#,

LP.APPLIED_SCN, LP.READ_SCN
  FROM DBA_LOGSTDBY_LOG LL, DBA_LOGSTDBY_PROGRESS LP
  WHERE
 LEAST(LP.APPLIED_SCN, LP.READ_SCN) <= LL.NEXT_CHANGE#;

< /a>

Compare the results of the statement to the contents of the STANDBY_ARCHIVE_DEST directory.

Step 5: Verify the New Configuration

  1. Verify that archived redo l ogs are being sent to the new apply host.

    Query V$ARCHIVE_STATUS and V$ARCHIVED_DEST_STATUS.

        SELECT NAME_SPACE, STATUS, TARGET, LOG_SEQUENCE ,
           
        TYPE,PROCESS, REGISTER , ERROR FROM V$ARCHIVE_DEST
        WHERE STATUS!='INACTIV
    E';
    
        SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS!='INACTIVE';
    
    
    
  2. Verify that the managed recovery or logical apply on the n ew apply host is progressing.

    Issue the following queries to ensure that the sequence n umber is advancing over time.

    Use the following statements for a physical standby database :

         SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD#;
         SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS FROM V$MANAGED_ STANDBY;
    
    

    Use the following statements for a logical standby database:

        SELECT MAX(SEQUENCE#), THREAD# FROM DBA_LOGSTDBY_LOG GROUP BY THREAD#;
        SELECT APPLIED_SCN FROM DBA_LO
    GSTDBY_PROGRESS;
    

Recovery Solutions for Data Failures

Recovering from a data failure is an unscheduled outage scenario. A data failure is usually, but not always, caused by some activity or failure that occurs outside the database, even though the problem may be evident within the database.

Data failure can affect the following types of database objects:

  • Datafiles such as application tablespaces, data dictionary or UNDO tablespace, temporary tablespace
  • Control file
  • Standby control file
  • Online redo log
  • Standby r edo log
  • Archived redo log
  • < /a>Server parameter file (SPFILE)

Data failure can be categorized as eit her datafile block corruption or media failure:

  • Datafile block corruption: A corrupt datafile block can be accessed but the contents within the block are invalid or inconsistent.
  • Media failure: Media failure results from a physical hardware problem or user error. The syste m cannot successfully read or write to a file that is necessary to operate the database.

In all environments, you can resolve a data failure outage by one of the following methods:

  • RMAN datafile media restoration and recovery
  • RMAN block media restoration and recovery
  • Manually re-create the object
  • < /ul>

    In a Data Guard environment, you can also use a Data Guard switchover or failover to a sta ndby database to recover from data failures.

    Another category of related outages that resul t in database objects becoming unavailable or inconsistent are caused by user error, such as dropping a table or erroneously updating table data. Information about recovering from user error can be found in "Recovering from User Error w ith Flashback Technology".

    The rest of this section i ncludes the following topics:

    Detecting and Recovering From Datafile Block Corruption

    A corrupt datafile block can be accessed, but the contents within the block are invalid or inconsistent. The typi cal cause of datafile corruption is a faulty hardware or software component in the I/O stack, which includes, but is not limited to, the file system, volume manager, device driver, host bus adapter, storage controller, and disk drive.

    The database usually remains available when corrupt blocks have been detected, but some corrupt blocks may cause widesprea d problems, such as corruption in a file header or with a data dictionary object, or corruption in a critical table that renders an a pplication unusable.

    The rest of this section includes the following topics:

    Detecting Datafile Block Corruption

    A data fault is detected when it is recognized by the user, administrator, RMAN backup, or application becaus e it has affected the availability of the application. For example:

    • A single corrupt data block in a user table that cannot be read by the application because of a bad spot of the physical disk
    • A database that automatically shuts down because of the invalid blocks of a dat afile in the SYSTEM tablespace caused by a failing disk controller

    Regul arly monitor application logs (which may be distributed across the data server, middle-tier and the client machines), the alert log, and Oracle trace files for errors such as ORA-1578 and ORA-1110

    ORA-01578: ORACLE data bloc
    k corrupted (file # 4, block # 26)
    ORA-01110: data file 4: '/u01/oradata/objrs/obj_corr.dbf'
    

    Recovering From Datafile Block Corruption

    After you have identified datafile block corruptio n, follow these steps:

    1. Determine the Extent of the Corruption Problem
    2. Replace or Move Away From Faulty Hardware
    3. Determine Which Objects Are Affected
    4. Decide Which Recovery Method to Use
    Determine the Extent of the Corruption Problem

    Use the following methods to determine the extent of the corruption:

    • Gather details from error messages

      Gather the file number, file name, and block number from the error messages. For example:

          ORA-
      01578: ORACLE data block corrupted (file # 22, block # 12698)
          ORA-01110: data file 22: '/oradata/SALES/user
      s01.dbf'
      
      

      The file number is 22, the block number is 12698, and t he file name is /oradata/SALES/users01.dbf.

    • Check log files f or additional error messages

      Record additional error messages that appear in the alert log, Oracle trace files, or application logs. Note that log files may be distributed across the data server, middle tier, and client machines.

    • Use Oracle utilities to check for additional corruption

      Use Oracle detection tools to find other data failure problems that may exist on the same disk or set of disks that have not yet been reported. For example, if the file number 22 has corrupt blocks, then it is prudent to run the RMAN BACKUP VALIDATE DATAFILE 22 command to detect additional corruption. Table 10-2 summarizes the Oracle tools that are available to detect datafile block corruption.

      Table 10-2 Oracle Tools for Detect ing Datafile Block Corruption 
      Oracle Tool Description Location of Additional Information

      RMAN BACKUP or RESTORE c ommand with VALIDATE option

      RMAN scans the specified f iles and verifies content-checking for physical and logical errors but does not actually perform the backup or recovery operation. Or acle records the address of the corrupt block and the type of corruption in the control file. Access these records through the V$DATABASE_BLOCK_CORRUPTION view, which can be used by RMAN block media recovery.

      If BLOCK CHANGE TRACKING is enabled, then do not use the INCREMENTAL LEVEL option with BACKUP VALIDATE to ensure that all data blocks are read and verified.

      To detec t all types of corruption that are possible to detect:

      • Do not specify the MAXCORRUPT option
      • Do not specify the NOCHECKSUM option
      • Do specify the CHECK LOGICAL option

      Oracle Database Backup and Recovery Advanced User's Guide

      < code>DBVERIFY utility

      External command-line utility that per forms physical structure data integrity check on a datafile or a specific segment within a datafile. Output of DBVERIFY goes to the screen or the specified LOGFILE.

      Oracle Database Utilities

      ANALYZE SQL statement with VALIDATE STRUCTURE option

      Verifies the integrity of t he structure of an index, table, or cluster; identifies corruption; and checks or verifies that your tables and indexes are consisten t. Problems are reported into the user session trace file in USER_DUMP_DEST.

      Oracle Database Administra tor's Guide

      DBMS_REPAIR PL/SQL package

      Performs block checkin g for a specified table, partition, or index. DBMS_REPAIR should be used with care and under the guidance of Oracle Supp ort.

      Oracle Database Administrator's Guide

    < !--TOC=h4-"1010164"-->
    Replace or Move Away From Faulty Hard ware

    Some corruption problems are caused by faulty hardware. If ther e is a hardware fault or a suspect component, then it is sensible to either repair the problem, or make disk space available on a sep arate disk subsystem before proceeding with a recovery option.

    If there are multiple errors , if there are operating system-level errors against the affected file, or if the errors are transient and keep moving about, then th ere is little point in proceeding until the underlying problem has been addressed or space is available on alternative disks. Ask you r hardware vendor to verify system integrity.

    In a Data Guard environment, a switchover can be performed to bring up the application and restore service quickly while the corruption problem is handled offline.

    Determine Which Objects Are Affected

    Using the file ID (fid) and bloc k ID (bid) gathered from error messages and the output from Oracle block checking utilities, determine which da tabase objects are affected by the corruption by using a query similar to the following:

    SE
    LECT tablespace_name, partition_name, segment_type,
           owner, segment_name
    FROM dba_e
    xtents 
    WHERE file_id = fid
      AND bid BET
    WEEN block_id AND block_id + blocks - 1;
    
    

    The following is an exam ple of the query and its resulting output:

    SQL> select tablespace_name, partition_name,
    segment_type,
      2  owner, segment_name from dba_extents 
      3  where file_id=4 and 11 betw
    een block_id and block_id + blocks -1;
    
    TABLESPACE_NAME   PARTITION_NAME     SEGMENT_TYPE
        OWNER   SEGMENT_NAME
    ---------------   --------------     ------------    -----   ------------
    USERS                                TABLE           SCOTT   EMP
    
    

    The integrity of a table or index can be determined by using the ANALYZE statement.

    Decide Which Recovery Method to Use

    The recommended recovery methods are summarized in Table 10-3 and Table 10-4. The recovery methods depend on whether Data Guard is being used.

    Table 10-3 summarizes recovery methods for data failure when Data Guard is not used.

    Table 10-3 Recovering From Data Failure Without Data Guard  

    Table 10-4 summarizes recovery methods for data fai lure when Data Guard is present.

    Table 10-4 Recovering from Data Failure With Data Guard  
    < table summary="" class="HRuleFormalWide" border="1" frame="HSIDES" rules="ROWS" width="100%" cellpadding="3" cellspacing="0" dir="ltr " title="">

    Widespread or unknown

    Object Affected Extent of Problem < strong>Action

    Data dictionary or UNDO segment

    N/A

    Use RMAN Datafile Media Recov ery

    Applicat ion segment (user table, index, cluster)

    Widespread or unknown

    < /td>

    Use RMAN Datafile Media Recovery

    or

    Re-Create Objects Manu ally

    Applica tion segment (user table, index, cluster)

    Localized

    Use RMAN Block Media Recovery

    or

    Re-Create Objects Manually

    TEMPORARY segment or temporary table

    N/A

    No impact to permanent objects. Re-create temporary tablespace if required.

    Object Affected Extent of Problem Impact to Applica tion Cost of Local Recovery Action

    Data dictionary or UNDO segment

    N/A

    N/A

    N/A

    Use Data Guard to Recover From Data Failure

    Application segment (user table, index, cluster)

    Low

    Low

    Use RMAN Datafile Media Recovery

    or

    Re-Create Objects Manually

    Application segment (user table, index, cluster)

    N/A

    High

    N/A

    Use Data Guard to Recover From Data Failure

    Application segment (user table, index, cluster)

    N/A

    N/A

    High

    Use Data Guard to Recover From Data Failure

    Application segment (user table, index, cluster)

    Localized

    Low

    Low

    Use RMAN Block Media Recovery

    or

    Re-Create Objects Manually

    TEMPORARY segment or temporary table

    N/A

    N/A

    N/A

    No impact to permanen t objects. Re-create temporary tablespace if required.

    The proper recovery method to use depends on the following criteria, as indicated in Table 10-3 and Table 10-4:

    • Object affected : The recovery actions available depend on which objects are affected by the corruption. Possible values are:
      • Data dictionary or UNDO segment: The object affected is owned by SY S or is part of UNDO tablespace
      • Temporary segment: Corru ptions within a temporary segment or object within a temporary tablespace do not affect permanent objects
      • Application segment: Table, index, or cluster used by the application
    • Impact to application

      An object may be critical for the application to function. This includes objects that are critical for the performance and usability of the application. It could also be a history, reporting, or logging table, which may not be as critical. It could also be an object that is no longer in use or a te mporary segment.

      This criterion only applies to a Data Guard environment and should be use d to decide between recovering the affected object locally and using Data Guard failover. Possible values are:

        < li class="LD2" type="CIRCLE">High: Object is critical to the application such that available or performance suf fers significantly
      • Low: Object is not critical and has little or no impact t o the application
    • Cost of local recovery

      < /p>

      This criterion only applies to a Data Guard environment and should be used to decide between recovering the affect ed object locally and using Data Guard failover. This is not a business cost which is assumed to be implicit in deciding how critical an object is to the application but cost in terms of feasibility of recovery, resources required and their impact on performance and total time taken.

      Cost of local recovery should include the time to restore and recover t he object from a valid source; the time to recover other dependent objects like indexes, constraints, and related tables and its inde xes and constraints; availability of resources like disk space, data or index tablespace, temporary tablespace; and impact on perform ance and functionality of current normal application functions due to absence of the corrupt object.

    • Extent of corruption

      Corruption may be localized so that it affects a known number of blocks within one or a few objects, or it may be widespread so that it affects a large portion of an object .

    Recovering From Media Failure

    When media failure occurs, follow t hese steps:

    1. Dete rmine the Extent of the Media Failure
    2. Replace or Move Away From Faulty Hardware
    3. Decide Which Recovery Action to Take
    Determine the Extent of the Media Failure

    Use the following methods to determine the extent of the media failure:

    • Gather details from error messages

      Gather the file number a nd file name from the error messages reported. Typical error messages are ORA-1157, ORA-1110 and ORA-1115, ORA-1110.

      For example, from the following error message:

          ORA-01115: IO
       error reading block from file 22 (block # 12698)
          ORA-01110: data file 22: '/oradata/SALES/users01.dbf'
      
      
    • Check log files for additional error messages

      Record additional error messages that appear in the system logs, volume manager logs, alert log, Oracle trace files, or application logs. Note that log files may be distributed across the data server, middle tier, and the cl ient machines.

    Replace or Move Away From Faulty Hardware

    If there i s a hardware fault or a suspect component, then it is sensible to either repair the problem or make disk space available on a separat e disk subsystem before proceeding with a recovery option.

    If there are multiple errors, if there are operating system-level errors against the affected file, or if the errors are transient and keep moving about, then there is little point in proceeding until the underlying problem has been addressed or space is available on alternative disks. Ask your ha rdware vendor to verify system integrity.

    Decide Which Recovery Action to Take

    The appropriate recovery action depends on what type of file is affected by the media failure. Table& nbsp;10-5 shows the type of file and the appropriate recovery.

    Table 10-5 Recovery Actions for Failure of Different Typ es of Files  

    Archived redo log file

    Type of File Recovery Acti on

    Datafile

    Media failure of a datafile is resolved in the same manner in which widespread datafile block corruption is handled.

    See Also: "Recovering From Datafile Block Corruption"

    Control file

    Loss of a control file causes the primary database to shut down. The steps to recover from control file failure include making a copy of a good control file, restoring a backup copy of the control file, or manually creating the cont rol file with the CREATE CONTROLFILE statement. The proper recovery method depends on the following:

    • Whether all current control files were lost or just a member of a multiplexed con trol file
    • Whether or not a backup control file is available

    See Also: "Performing User-Managed Flashback and Recovery" in Oracle Database Backup and Recovery Advanced User's Guide

    Standby c ontrol file

    Loss of a standby control file causes the standby datab ase to shut down. It may also, depending on the primary database protection mode, cause the primary database to shut down. To recover from a standby control file failure, a new standby control file must be created from the primary database and transferred to the sta ndby system.

    See Also: "Creating a Physical Standby Database" in Oracle Data Guard Concepts and Administration< /a>

    Online redo log file

    If a media failure has affected the online redo logs of a database, then the appropriate recovery procedure depends on the following:

    • The configuration of the online redo log: mirrored or non-mirrored
    • The type of media failure: temporary or permanent
    • The status of the on line redo log files affected by the media failure: current, active, unarchived, or inactive

    See Also: "Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery Advanced User's Guide

    If the online redo log failure causes the primary database to shut down and incomplete recovery must be used to ma ke the database operational again, then Flashback Database can be used instead of restoring all datafiles. Use Flashback Database to take the database back to an SCN before the SCN of the lost online redo log group. The resetlogs operation that is done as part of th e Flashback Database procedure reinitializes all online redo log files. Using Flashback Database is faster than restoring all datafil es.

    If the online redo log failure causes the primary database to shut down in a Data Guard environment, it may be desirable to perform a Data Guard failover to reduce the time it takes to restore service to users and to redu ce the amount of data loss incurred (when using the proper database protection mode). The decision to perform a failover (instead of recovering locally at the primary site with Flashback Database, for example) depends on the estimated time to recover from the outage at the primary site, the expected amount of data loss, and the impact the recovery procedures taken at the primary site may have on the standby database.

    For example, if the decision is to recover at the primary site, then t he recovery steps may require a Flashback Database and open resetlogs, which may incur a full redo log file of lost data. A standby d atabase will have less data loss in most cases than recovering at the primary site because all redo data is available to the standby database. If recovery is done at the primary site and the standby database is ahead of the point to which the primary database is rec overed, then the standby database must be re-created or flashed back to a point before the resetlogs SCN on the primary database.

    See Also: "Creating a Physical Standby Database" in Oracle Data Guard Concepts and Administration

    Standby redo log file

    Standby redo log failure affects only the standby database in a Data Guard environment. Most standby redo log failures are handled automatically by the standby database without affecting the primary da tabase. However, if a standby redo log file fails while being archived to, then the primary database treats it as a log archive desti nation failure.

    See Also: "Dete rmine the Data Protection Mode"

    Loss of an archived redo log does not affect availability of the primary database directly, but it may significantly affec t availability if another media failure occurs before the next scheduled backup, or if Data Guard is being used and the archived redo log had not been fully received by the standby system and applied to the standby database before losing the file.

    See Also: "Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery Advanced User's Guide

    If an archived redo log is lost in a Data Guard environment and the log has already been applied t o the standby database, then there is no impact. If there is no valid backup copy of the lost file, then a backup should be taken imm ediately of either the primary or standby database because the lost log will be unavailable for media recovery that may be required f or some other outage.

    If the lost archived redo log has not yet been applied to the standby database, then a backup copy of the file must be restored and made available to the standby database. If there is no valid backup cop y of the lost archived redo log, then the standby database must be reinstantiated from a backup of the primary database taken after t he NEXT_CHANGE# of the lost log (see V$ARCHIVED_LOG).

    Server parameter file (SPFILE)

    Loss of the server parameter file does not affect availability of the database. SPFILE is necessary for database startup. With the flash recovery and RMAN CONTROLFILE AUTOBACKUP features enabled, res toring a server parameter file from backup is a fast operation.

    See Als o: "Performing Recovery" of Oracle Datab ase Backup and Recovery Basics

    Oracle Cluster Registry (OCR)

    Loss of the O racle Cluster Registry file affects the availability of RAC and Cluster Ready Services. The OCR file can be restored from a physical backup that is automatically created or from an export file that is manually created by using the ocrconfig tool.

    See Also: "Administering Storage in Real Application Clusters" in Oracle Real Application Clusters Administrator's Guide

    Recovery Methods for Data Failures

    The fo llowing recovery methods can be used in all environments:

    Always use local recovery methods when Data Guard is not being used. Local recovery methods may also be appropriate in a Data Guard environment. This section also includes the following topic:

    Use RMAN Datafile Media Recovery

    Datafile media recovery recovers an entire datafile or set of datafiles for a database by using the RMAN RECOVER command. When a large or unknown number of data blocks are marked media-corrupt and require media recovery, or when an entire file is lost, the affected da tafiles must be restored and recovered.

    Use RMAN file media recovery when the following con ditions are true:

    • The number of blocks requiring recovery is l arge or unknown
    • Block media recovery is not available (for example, if incompl ete recovery is required, or if only incremental backups are available for the datafile requiring recovery)

      See Also:

      "Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery Advanced User's Guide

    Use RMAN Block Media Recovery

    Block media recovery (BMR) recovers one or a set of data blocks marked "medi a corrupt" within a datafile by using the RMAN BLOCKRECOVER command. When a small number of data blocks are marked media corrupt and require media recovery, you can selectively restore and recover damaged blocks rather than whole datafiles. This results in lower mean time to recovery (MTTR) because only blocks that need recovery are restored and only necessary corrupt blocks undergo recovery. Block media recovery minimizes redo application time and avoids I/O overhead during recovery. It also enables affected data files to remain online during recovery of the corrupt blocks. The corrupt blocks, however, remain unavailable until they are complete ly recovered.

    Use block media recovery when:

    • A small number of blocks require media recovery and the blocks that need recovery are known. If a significan t portion of the datafile is corrupt, or if the amount of corruption is unknown, then a different recovery method should be used.
    • Blocks are marked corrupt (verified with the RMAN BACKUP VALIDATE co mmand) and only when complete recovery is required.

    Block media recovery cannot be us ed to recover from the following:

    • User error or software bugs that cause logical corruption where the data blocks are intact. See "Recovering from User Error with Fl ashback Technology" for additional details for this type of recovery.
    • Changes caused by corrupt redo data. Block media recovery requires that all available redo data be appl ied to the blocks being recovered.

    The following are useful practices when using bloc k media recovery:

    • The flash recovery area should have a retent ion policy such that backups of all datafiles are retained on disk longer than the frequency of use of an Oracle tool that detects bl ock corruption. For example, if RMAN BACKUP VALIDATE is run once a week, then the flash recovery area should have a rete ntion policy greater than one week. This ensures that corrupt blocks can be recovered quickly using block media recovery with disk-ba sed backups.
    • Even if archived redo logs have missing redo data, block media re covery can still work if the block has been renewed since the data file backup used for the restoration of the block, or if there are no changes for the specific block in the missing redo data.
    • If RMAN block val idation has been run proactively, then the V$DATABASE_BLOCK_CORRUPTION view has a list of blocks validated as corrupt by RMAN. RMAN can be instructed to recover all corrupt blocks listed in V$DATABASE_BLOCK_CORRUPTION using block media reco very.

    Re-Create Objects Manually

    Some database objects, such as small look-up tables or indexes, can be recovered quickly by ma nually re-creating the object instead of doing media recovery.

    Use manual object re-creatio n when:

    • You need to re-create a small index because of media c orruption. Creating an index online enables the base object to be used concurrently.
    • You need to re-create a look-up table or when the scripts to re-create the table are readily available. Dropping and re-creati ng the table may be the fastest option.

    Use Data Guard to Recover From Data Failure

    < p class="BP">Failover is the operation of taking the production database offline on one site and bringing one of the standby database s online as the new production database. A database switchover is a planned transition in which a standby database and a production d atabase switch roles.

    Use Data Guard switchover or failover for data failure when:

    • The database is down or when the database is up but the application is unavailable because of data failure, and the time to restore and recover locally is long or unknown.
    • The business cost of a switchover or failover and re-creating the standby database is less than the cost of t he expected downtime or reduced capacity while local recovery steps are taken.
    • The proper decision to recover locally is unclear or too complex, and the data failure is affecting the availability of the applicat ion.

    Recovering from User Er ror with Flashback Technology

    Oracle flashback technology revolution izes data recovery. In the past it took seconds to damage a database but hours to days to recover it. With flashback technology, the time to correct errors can be as short as the time it took to make the error. Fixing user errors that require rewinding the database, table, transaction, or row level changes to a previous point in time is easy and does not require any database or object restoration . Flashback technology provides fine-grained analysis and repair for localized damage such as erroneous row deletion. Flashback techn ology also enables correction of more widespread damage such as accidentally running the wrong application batch job. Furthermore, fl ashback technology is exponentially faster than a database restoration.

    Flashback technolog ies are applicable only to repairing the following user errors:

    Flashback technologies cannot be used for media or da ta corruption such as block corruption, bad disks, or file deletions. See "Recovery Solutions for Data Failures" and "Database Failover" to repair these outages.

    Table 10-6 summarize s the flashback solutions for each type of outage.

    Table 10-6 Flashback Solutions for Different Outages  < /strong>
    Impact of Outage Examples of User Errors< /th> Flashback Solutions
    < a name="1007487">

    Row or transaction

    See Also: "Resolving Row and Transaction Inconsistencies"

    Use a combination of:

    See Also: "Flashback Query"

    Table

    See Also:"Resolving Table Inconsistencies"

    • Dropped table
    • Erroneous transactions affecting one table or a set of tables

    Tablespace or database

    See Also : "Resolving Database-Wide Inconsistencies"

    • Erroneous batch job affecting many tables or an unknown se t of tables
    • Series of database-wide malicious transactions
    • Drop tablespace without removing the physical datafiles

    Table 10-7 summarizes e ach flashback feature.

    Table 10-7 Summary of Flashback Features  
    < tr class="Formal" align="left" valign="top">
    Flashback Feature < /a> Description

    Flashback Query

    Flashback Query enables you to view data at a point in time in the past. It can be used to view and reconstruct lost data that was deleted or changed by accident. Developers can use this feature to build s elf-service error correction into their applications, empowering end users to undo and correct their errors.

    Note: Changes are propagated to physical and logical standby databases.

    Flashback Version Query

    Flashback Version Query uses undo data stored in the database to view the changes to one or more rows along with all the metadata of the changes.

    Note: Changes are propagated to physical and logical standby databases.

    Flashback Transaction Query< /p>

    Flashback Transaction Query enables you to examine changes to the d atabase at the transaction level. As a result, you can diagnose problems, perform analysis, and audit transactions.

    Note: Changes are propagated to physical and logical standby databases.

    Flashback Drop

    Flashback Drop provides a way to restore accidentally dropped tables.

    Note: Changes are propagated to p hysical standby databases.

    Flashback Table

    Flashb ack Table enables you to quickly recover a table to a point in time in the past without restoring a backup.

    Note: Changes are propagated to physical and logical standby databases.

    F lashback Database

    Flashback Database enables you to quickly ret urn the database to an earlier point in time by undoing all of the changes that have taken place since that time. This operation is f ast because you do not need to restore the backups.

    Flashback Dat abase uses the Oracle Database flashback logs, while all other features of flashback technology use the Oracle Database unique undo a nd multiversion read consistency capabilities. See "Configuration Best Practices for the Database" for configuring flashback technologies to ensure that the resources from these solutions are avail able at a time of failure.

    The rest of this section includes the following topics:

    < !--TOC=h2-"1007226"-->

    Resolving Row and Transaction Inconsi stencies

    Resolving row and transaction inconsistencies may require a combination of Flashback Query, Flashback Version Query, Flashback Transaction Query, and the suggested undo statements to rectify t he problem. The following sections describe a general approach using a human resources example to resolve row and transaction inconsi stencies caused by erroneous or malicious user errors.

    This section includes the following topics:

    Flashback Query

    Flashback Query, a feature introduced in the Oracle9i Database, enables an administrator or user to query any data at some point in time in the past. This powerful feature can be used to view and reconstruct data that may have been deleted or changed by accident. For example:

    SELECT * FROM EMPLOYEES 
           AS OF TIMESTAMP 
           TO_DATE('28-Aug-03 14:00'
    ,'DD-Mon-YY HH24:MI')
     WHERE ...
    
    

    This parti al statement displays rows from the EMPLOYEES table starting from 2 p.m. on August 28, 2003. Developers can use this fea ture to build self-service error correction into their applications, empowering end users to undo and correct their errors without de lay, rather than burdening administrators to perform this task. Flashback Query is very simple to manage, because the database automa tically keeps the necessary information to reconstruct data for a configurable time into the past.

    Flashback Version Query

    Flashback Version Query provides a way to view changes made to the database at the row level. It is an extension to SQL and enables the retrieval of all the different versions of a row across a specified time interval. F or example:

    SELECT * FROM EMPLOYEES
           VERSIONS BETWEEN TIMESTAMP
    
           TO_DATE('28-Aug-03 14:00','dd-Mon-YY hh24:mi') AND
           TO_DATE('28-Aug-03 15:0
    0','dd-Mon-YY hh24:mi')
    WHERE ...
    
    

    This stat ement displays each version of the row, each entry changed by a different transaction, between 2 and 3 p.m. today. A DBA can use this to pinpoint when and how data is changed and trace it back to the user, application, or transaction. This enables the DBA to track d own the source of a logical corruption in the database and correct it. It also enables application developers to debug their code.

    Flashback T ransaction Query

    Flashback Transaction Query provides a way to view changes made to the database at the transaction level. It is an extension to SQL that enables you to see all changes made by a transa ction. For example:

    SELECT UNDO_SQL
    FOMR DBA_TRANSACTION_QUERY 
    WHERE XID = '000200030000002D';
    
    

    This statement s hows all of the changes that resulted from this transaction. In addition, compensating SQL statements are returned and can be used to undo changes made to all rows by this transaction. Using a precision tool like this, the DBA and application developer can precisely diagnose and correct logical problems in the database or application.

    Example: Using Flashback Technology to Investigate Salary Discrepancy

    Consider a human resources (HR) example involving the SCOTT schema. the HR manager reports to the DBA that there is a potential discrepancy in Ward's salary. Sometime before 9:00 a.m., Ward 's salary was increased to $1875. The HR manager is uncertain how this occurred and wishes to know when the employee's salary was inc reased. In addition, he has instructed his staff to reset the salary to the previous level of $1250, and this was completed around 9: 15 a.m.

    The following steps show how to approach the problem.

    1. Assess the problem.

      Fortunate ly, the HR manager has provided information about the time when the change occurred. We can query the information as it was at 9:00 a .m. with Flashback Query.

          SELECT EMPNO, ENAME, SAL
          FROM EMP
      <
      a name="1012425">    AS OF TIMESTAMP TO_DATE('03-SEP-03 09:00','dd-Mon-yy hh24:mi')
          WHERE ENAME = 'WARD
      ';
      
               EMPNO ENAME             SAL
          ---------- ----------
      ----------
                7521 WARD             1875
      
      

      We can confirm we have the correct employee by the fact that Ward's salary was $1875 at 09:00 a.m. Rather than using Ward's nam e, we can now use the employee number for subsequent investigation.

    2. Query past rows or versions of the data to acquire transaction information.

      Although i t is possible to restrict the row version information to a specific date or SCN range, we decide to query all the row information tha t we have available for the employee WARD using Flashback Version Query.

          SELECT EMPNO,
      ENAME, SAL, VERSIONS_STARTTIME, VERSIONS_ENDTIME
          FROM EMP
          VERSIONS BETWEEN TIMEST
      AMP MINVALUE AND MAXVALUE
          WHERE EMPNO = 7521
          ORDER BY NVL(VERSIONS_STARTSCN,1);
      <
      a name="1012447">
              EMPNO ENAME             SAL VERSIONS_STARTTIME     VERSIONS_ENDTIME
          -------- ---------- ---------- ---------------------- ----------------------
               7521 WARD
                1250 03-SEP-03 08.48.43 AM  03-SEP-03 08.54.49 AM
               7521 WARD             1875 03-SEP-03 08.54
      .49 AM  03-SEP-03 09.10.09 AM
               7521 WARD             1250 03-SEP-03 09.10.09 AM
      
      

      We can see that WARD's salary was increased from $1250 to $1875 at 08:54:49 the same mo rning and was subsequently reset to $1250 at approximately 09:10:09.

      Also, we can modify t he query to determine the transaction information for each of the changes effecting WARD using a similar Flashback Version Query. Thi s time we use the VERSIONS_XID pseudocolumn.

          SELECT EMPNO, ENAME, SAL, VER
      SIONS_XID
          FROM EMP
          VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE
          WHERE EMPNO = 7521
          ORDER BY NVL(VERSIONS_STARTSCN,1);
      
               EMPNO ENAME             SAL VERSIONS_XID
          ---------- ---------- ---------- ----------------
                7521 WARD             1250 0006000800000086
                7521 WARD             1875
      0009000500000089
                7521 WARD             1250 000800050000008B
      
      
    3. Query the erroneous transaction and the scope of its impact.

      With the transaction information (VERSIONS_XID pseudocolumn), we can now query the database t o determine the scope of the transaction, using Flashback Transaction Query.

          SELECT UND
      O_SQL
          FROM FLASHBACK_TRANSACTION_QUERY
          WHERE XID = HEXTORAW('0009000500000089');
      
          UNDO_SQL                                                                    
          ----------------------------------------------------------------------------
          update "SCO
      TT"."EMP" set "SAL" = '950' where ROWID = 'AAACV4AAFAAAAKtAAL';      
          update "SCOTT"."EMP" set "SAL" = '150
      0' where ROWID = 'AAACV4AAFAAAAKtAAJ';     
          update "SCOTT"."EMP" set "SAL" = '2850' where ROWID = 'AAACV4AA
      FAAAAKtAAF';      
          update "SCOTT"."EMP" set "SAL" = '1250' where ROWID = 'AAACV4AAFAAAAKtAAE';    
          update "SCOTT"."EMP" set "SAL" = '1600' where ROWID = 'AAACV4AAFAAAAKtAAB';     
      
                                                                          
          6 rows selected.
      
      

      We can see that WARD's salary was not the only change that occurred in the transactio n. The information that was changed for the other four employees at the same time as WARD can now be passed back to the HR manager fo r review.

    4. Determine if the corrective statements should be executed .

      If the HR manager decides that the corrective changes suggested by the UNDO_SQL column are correct, then the DBA can execute these statements individually.

    Resolving Table Inconsistencies

    Oracle provides a FLASHBACK DROP statement to recover from an accidenta l DROP TABLE statement, and a FLASHBACK TABLE statement to restore a table to a previous point in the datab ase.

    This section includes the following topics:

    Flashback Table

    Flash back Table provides the DBA the ability to recover a table, or a set of tables, to a specified point in time quickly and easily. In m any cases, Flashback Table alleviates the need to perform more complicated point in time recovery operations. For example:

    FLASHBACK TABLE orders, order_items 
          TO TIMESTAMP 
    
          TO_DATE('29-AUG-03 14.00.00','dd-Mon-yy hh24:mi:ss');
    
    

    This statement rewinds any updates to the ORDERS and ORDER_ITEMS tables that have been done between the current time and specified timestamp in the past. Flashback Table performs this operation online and in place, and it maintains referential i ntegrity constraints between the tables.

    Flashback Drop

    Dropping or deleting database objects by accident is a common mistake. Users soon realize their mistake, but by then it is too late and there has been no way to easily recover the dropped tables and its indexes, constraints, and triggers. Objects once dropped were dropped forever. Loss of very important tables or other objects (like indexes, partitions or clusters) required DBAs to perform a point-in-time recovery, w hich can be time-consuming and lead to loss of recent transactions.

    Flashback Drop provides a safety net when dropping objects in Oracle Database 10g. When a user drops a table, Oracle places it in a recycle bin. Objects in the recycle bin remain there until the user decides to permanently remove them or until space limitations beg in to occur on the tablespace containing the table. The recycle bin is a virtual container where all dropped objects reside. Users ca n look in the recycle bin and undrop the dropped table and its dependent objects. For example, the employees table and a ll its dependent objects would be undropped by the following statement:

    FLASHBACK TABLE emp
    loyees TO BEFORE DROP;
    

    Resolving Database-Wide Inconsistencies

    Oracle prov ides Flashback Database to rewind the entire database to a previous point in time. This section includes the following topics:

  • Flashback Database
  • Using Flashback Database to Repair a Dropped Tablespace

Flashb ack Database

To bring an Oracle database to a previous point in time , the traditional method is point-in-time recovery. However, point-in-time recovery can take hours or even days, since it requires th e whole database to be restored from backup and recovered to the point in time just before the error was introduced into the database . With the size of databases constantly growing, it will take hours or even days just to restore the whole database.

Flashback Database is a new strategy for doing point-in-time recovery. It quickly rewinds an Oracle databas e to a previous time to correct any problems caused by logical data corruption or user error. Flashback logs are used to capture old versions of changed blocks. One way to think of it is as a continuous backup or storage snapshot. When recovery needs to be performed the flashback logs are quickly replayed to restore the database to a point in time before the error and just the changed blocks are restored. It is extremely fast and reduces recovery time from hours to minutes. In addition, it is easy to use. A database can be rec overed to 2:05 p.m. by issuing a single statement. Before the database can be recovered, all instances of the database must be shut d own and one of the instances subsequently mounted. The following is an example of a FLASHBACK DATABASE statement.

FLASHBACK DATABASE TO TIMESTAMP TIMESTAMP'2002-11-05 14:00:00';

No restoration from tape, no lengthy downtime, and no complicated recovery procedures are required to use it. You can also use Flashback Database and then open the database in read-only mode and examine its contents. If you determi ne that you flashed back too far or not far enough, then you can reissue the FLASHBACK DATABASE statement or continue re covery to a later time to find the proper point in time before the database was damaged. Flashback Database works with a production d atabase, a physical standby database, and a logical standby database.

These steps are recom mended for using Flashback Database:

  1. Determine the time or the SCN to which to flash back the database.
  2. Verify that t here is sufficient flashback log information.
         SELECT OLDEST_FLASHBACK_SCN, 
           TO_CHAR(OLDEST_FLASHBACK_TIME, 'mon-dd-yyyy HH:MI:SS') 
           FROM V$FLASHBACK_DATABASE_LOG;
    <
    a name="1007337">
    
  3. Flash back the database to a specific tim e or SCN. (The database must be mounted to perform a Flashback Database.)
        FLASHBACK DATABA
    SE TO SCN scn;
    
    

    or

        FLASHBACK DATABASE TO TIMESTAMP TO_DATE date;
    
    
  4. Open the database in read-only mode to verify that it is in the correct state.
        ALTER DATABASE OPEN READ ONLY;
    
    

    If more flashback data is required, then issue another FLASHBACK DATABASE statement. (The database must be mou nted to perform a Flashback Database.)

    If you want to move forward in time, issue a statem ent similar to the following:

         RECOVER DATABASE UNTIL [TIME | CHANGE] date | scn;
    
    
  5. < /a>Open the database.


    Caution:

    After you open the database, you will not be able to flash back to an SCN before the resetlogs SCN, so ensure that the database is in the correct state before issuing the A LTER DATABASE OPEN RESETLOGS statement.


        ALTER DATAB
    ASE OPEN RESETLOGS;
    
    

Other considerations when using Fla shback Database are as follows:

  • If there are not sufficient fl ashback logs, then use Data Guard if available or restore from backups.
  • After flashing back a database, any dependent database such as a standby database needs to be flashed back. See Chapter 11, "Restoring Fault Tolerance".

Using Flashback Database to Repair a Dropped Tablespace

Flashback Database does not automatically fix this problem, but it can be used to dramatically reduce the downtime. You can flash back the production database to a point before the tablespace was dropped and then restore a backu p of the corresponding datafiles from the affected tablespace and recover to a time before the tablespace was dropped.

Follow these recommended steps to use Flashback Database to repair a dropped tablespace:

  1. Determine the SCN or time you dropped the tablespace.
  2. Flash back the database to a time before the tablespace was dropped. You can us e a statement similar to the following:
        FLASHBACK DATABASE TO BEFORE SCN drop_scn;
    
    
  3. Restore, rename, and br ing datafiles online.
  1. Restore only the d atafiles from the affected tablespace from a backup.
  2. Rename the unnamed files to the backup files.
             ALTER DATABASE RENAME FILE '.../UNNAMED00005' to 'restored_file';
    
    
  3. Bring the datafiles online .
             ALTER DATABASE DATAFILE 'name' ONLINE; 
    <
    /a>         
    
  • Query and recover the d atabase.
        SELECT CHECKPOINT_CHANGE# FROM V$DATAFILE_HEADER WHERE FILE#=1;
        RECOVER DATABASE UNTIL CHANGE checkpoint_change#;
    
    
  • Open the database.
        ALTER DATABASE OPEN RESETLOG
    S;
    

    RAC Rolling Upgrade

    "One-off" patches or interim patches to database so ftware are usually applied to implement known fixes for software problems an installation has encountered or to apply diagnostic patc hes to gather information regarding a problem. Such patch application is often carried out during a scheduled maintenance outage.

    Oracle now provides the capability to do rolling patch upgrades with Real Application Clusters with little or no database downtime. The tool used to achieve this is the opatch command-line utility.

    The advantage of a RAC rolling upgrade is that it enables at least some instances of the RAC installation t o be available during the scheduled outage required for patch upgrades. Only the RAC instance that is currently being patched needs t o be brought down. The other instances can continue to remain available. This means that the impact on the application downtime requi red for such scheduled outages is further minimized. Oracle's opatch utility enables the user to apply the patch success ively to the different instances of the RAC installation.

    Rolling upgrade is available only for patches that have been certified by Oracle to be eligible for rolling upgrades. Typically, patches that can be installed in a ro lling upgrade include:

    • Patches that do not affect the contents of the database such as the data dictionary
    • Patches not related to RAC intern ode communication
    • Patches related to client-side tools such as SQL*PLUS, Oracl e utilities, development libraries, and Oracle Net
    • Patches that do not change shared database resources such as datafile headers, control files, and common header definitions of kernel modules

    Rolling upgrade of patches is currently available for one-off patches only. It is not available for pat ch sets.

    Rolling patch upgrades are not available for deployments where the Oracle Database software is shared across the different nodes. This is the case where the Oracle home is on Cluster File System (CFS) or on shared v olumes provided by file servers or NFS-mounted drives. The feature is only available where each node has its own copy of the Oracle D atabase software.

    This section includes the following topics:

    Applying a Patch wit h opatch

    The opatch utility applies a patch successivel y to nodes of the RAC cluster. The nature of the patch enables a RAC installation to run in a mixed environment. Different instances of the database may be operating at the same time, and the patch may have been applied to some instances and not others. The op atch utility automatically detects the nodes of the cluster on which a specific RAC deployment has been implemented. The patch is applied to each node, one at a time. For each node, the DBA is prompted to shut down the instance. The patch is applied to the da tabase software install on that node. After the current node has been patched, the instance can be restarted. After the patch is appl ied on the current node, the DBA is allowed to choose the next RAC node to apply the patch to. The cycle of instance shutdown, patch application, and instance startup is repeated. Thus, at any time during the patch application, only one node needs to be down.

    To check if a patch is a rolling patch, execute the following on UNIX platforms:

    opatch query -is_rolling
    
    

    (On Windows, e xecute opatch.bat.)

    Enter the patch location after the prompt.

    To apply a patch to all nodes of the RAC cluster, execute the following command:

    opatch apply patch_location
    
    

    opatch automatically recognizes the patch to be a rolling patch and provides the required behavior.

    To apply a patch to only the local node, enter the following command:

    opatch apply -local patch_location
    
    

    T o check the results of a patch application, check the logs in the following location:

    $ORAC
    LE_HOME/.patch_storage/patch_id/patch_id_Apply_timestamp.log
    

    Rollin g Back a Patch with opatch

    Patches can be rolled back with the opatch utility. This enables the DBA to remove a troublesome patch or a patch that is no longer required. This can be done as a rolling procedure.

    To roll back a patch across all nodes of a RAC cluster, execute the f ollowing command:

    opatch rollback -id patch_id -ph patch_location
    
    

    To roll back a patch on the local node onl y, enter the following command:

    opatch rollback -local -id patch_id
     -ph patch_location
    
    

    To check the results of a patch rollback, check the logs in the following location:

    $ORACLE_HOME/.patch_storage/
    patch_id/patch_id_RollBack_timestamp.log
    

    Using opatch to List Insta lled Software Components and Patches

    The opatch utility also provides an option to list the installed software components as well as the installed patches. Enter the following command:

    opatch lsinventory 
    
    

    For de tails on usage and the other options to these commands, see MetaLink Notes 242993.1 and 189489.1 at http://metalink.oracle.com.

    Recommended Practices for RAC Rolling Upgrades

    < /a>

    The following are recommended practices for all database patch upgrades:

    • Always confirm with Oracle Support Services that the patch is valid for your problem and for your deplo yment environment.
    • Have a plan for applying the patch as well as a plan for ba cking out the patch.
    • Apply the patch to your test environment first and verify that it fixes the problem.
    • When you plan the elapsed time for applying the pa tch, include time for starting up and shutting down the other tiers of your technology stack if necessary.

    The following are additional recommended practices for RAC rolling upgrades.

    • If multiple instances share an Oracle home, then all of them will be affected by application of a patch. The DBA should verify that this will not cause unintentional side effects. Also, all such instances on a node must be shut d own during the patch application. Scheduled outage planning should take this into account. As a best practice, only similar applicati ons should share an Oracle home on a node. This provides greater flexibility for patching.
    • The Oracle inventory on each node is a repository of the Oracle Database software installed on the node. The inventory i s node-specific. It is shared by all Oracle software installed on the node. It is similar across nodes only if all nodes are exactly the same in terms of the Oracle Database software deployed, the deployment configuration, and patch levels. Because the Oracle invent ory greatly aids the patch application and patch management process, it is recommended that its integrity be maintained. Oracle inven tory should be backed up after each patch installation to any Oracle software on a specific node. This applies to the Oracle inventor y on each node of the cluster.
    • Use the Oracle Universal Installer to install a ll Oracle database software. This creates the relevant repository entries in the Oracle inventory on each node of the cluster. Also, use the Oracle Universal Installer to add nodes to an existing RAC cluster.

      However, if this was not done or is not feasible for some reason, adding information about an existing Oracle database software installation to the Oracle inventory can be done with the attach option of the opatch utility. Node information can be also added with this option.

    • The nature of the rolling patch upgrade enables i t to be applied to only some nodes of the RAC cluster. So an instance can be operating with the patch applied, while another instance is operating without the patch. This is not possible for nonrolling patch upgrades. Apply nonrolling patch upgrades to all instances before the RAC deployment is activated. A mixed environment is useful if a patch needs to be tested before deploying it to all the i nstances. Applying the patch with the -local option is the recommended way to do this.

      In the interest of keeping all instances of the RAC cluster at the same patch level, it is strongly recommended that aft er a patch has been validated, it should be applied to all nodes of the RAC installation. When instances of a RAC cluster have simila r patch software, services can be migrated among instances without running into the problem a patch may have fixed.

    • All patches (including those applied by rolling upgrades) should be maintained online and n ot removed once they have been applied. This is useful if a patch needs to be rolled back or needs to be applied again.

      The patches should be stored in a location that is accessible by all nodes of the cluster. Thus all nodes of the cluster are equivalent in their capability to apply or roll back a patch.

    • Rolling patch upgrades, just like any other patch upgrade, should be done when no other patch upgrade or Oracle installa tion is being done on the node. Application of multiple patches is a sequential process. The scheduled outage should be planned accor dingly.
    • If multiple patches need to be applied and they must be applied at the same time, and if only some of these patches are eligible for rolling upgrade, apply all of them in a nonrolling manner. This reduce the overall time required to get through the patching process.
    • For patches th at are not eligible for rolling upgrade, the next best option for RAC deployments is the minimize_downtime option of the apply command.
    • Perform the rolling upgrade when system usage is low. This ensures minimal disruption of service for the end user.

    Upgrade with Logical Standby Database

    Using a logical standby database enables you to accomplish upgrades for database software and patch sets with almost no downtime.


    Note:

    This capability is not available for upgrading from Oracle9i to Oracle Database 10g.


    If a logical s tandby database does not currently exist, then verify that a logical standby database supports all of the essential datatypes of your application.

    First, create or establish a logical standby database. Figure 10-4 shows a production database and a logical standby database, which are both version X databases.

    Figure 10-4 Establish a Logica l Standby Database

    Text description of maxav023.gif follows

    Text description of the illustration maxav023.gif

    Seco nd, stop the SQL Apply process and upgrade the database software on the logical standby database to version X+1. Figure 10-5 shows the production database, version X, and the logical standby database, version X+1.

    Figure 10-5 Upgrade the Logical Standby Database Version< /em>

    Text description of maxav026.gif follows

    Text description of the illustration maxav026.gif

    Third, restart SQL Appl y and operate with version X on the production database and version X+1 on the standby database. The configuration can run in the mix ed mode shown in Figure 10-6 for an arbitrary period to validate the upgrade in the production environment.

    Figure 10-6 Running in Mixed M ode

    Text description of maxav024.gif follows

    Text description of the illustration maxav024.gif

    When you are satisf ied that the upgraded software is operating properly, you can reverse the database roles by performing a switchover. This may take on ly a few seconds. Switch the database clients to the new production database, so that the application becomes active. If application service levels degrade for some reason, then you can open the previous production database again, switch users back, and terminate th e previous steps.

    Figure 10-7 shows that the former standby database (version X+1) is now the production database, and the former production database (version X) is now the standby data base. The clients are connected to the new production database.

    Figure 10-7 After a Switchover Has Been Performed

    Text description of maxav027.gif follows

    Text description of the illustration maxav027.gif

    Upgrade the new standby database. Figure 10-8 s hows the system after both databases have been upgraded to version X+1.

    Figure 10-8 Both Databases Have Been Upgraded

    Text description of maxav025.gif follows

    Text description of the illustration maxav025 .gif

    The role reversal that was just described includes the following steps:

    1. Initiate a Data Guard switchover, making the original st andby database the production database. This can occur in a few seconds.
    2. Switch the database clients to the new production database, so that the application and service become active. If application servi ce levels degrade for some reason, then you can open the previous production database again, switch users back, and terminate the pre vious steps.
    3. Upgrade the database software on the standby database afte r you are satisfied with current operations.
    4. Raise the compatibility le vel of both databases.

    Online Object Reorganization

    Oracle's online obj ect reorganization capabilities have been available since Oracle8i. These capabilities enable object reorgani zation to be performed even while the underlying data is being modified.

    Table 10-8 describes a few of the object reorganization capabilities available with Oracle Database 10g.

    Table 10-8 Some Object Reorganization Capabilities  
    < tr class="Formal">
    Object Type Example of Object Reorganization Solution Description of Sol ution

    Table

    DBMS_REDEFINITION PL/SQL packa ge

    A PL/SQL package that provides a mechanism to redefine tables on line

    Index

    Rebuild index

    Rebuild an index that has previously been marked as unusable

    Tablespace

    Ren ame tablespace

    Enables an existing tablespace to be renamed without rebuilding the tablespace and its contents

    Online Table Reorganization

    In highly available systems, it is occasionally necessary to redefine large tables that are constantly accessed to improve the performance of queries or DML performed. Oracle provides the DBMS_REDEFINITION PL/SQL package to redefine tables online. This package provides a significant increase in availability compared to traditional methods of redefining tables tha t require tables to be taken offline.

    Online Index Reorganization

    An index can be rebuilt online using the previous index definition, optionally moving the index to a new tablespace.

    Online Tablespace Reorganization

    Oracle Database 10g introduces the ability to ren ame a tablespace, similar to the ability to rename a column, table and datafile. Previously, the only way to change a tablespace name was to drop and re-create the tablespace, but this meant that the contents of the tablespace had to be dropped and rebuilt later. Wi th the ability to rename a tablespace online, there is no interruption to the users.

    ALTER
    TABLESPACE USERS RENAME TO new_tablespace_name;
    
    Tablespace alter
    ed.