Skip Headers

Oracle® High Availability Archite cture and Best Practices
10g Release 1 (10.1)

Part Number B10726-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previo us
Go to next pag
e
Next
View PDF
< /a>

8
Using Oracle Enterprise Manager for Monitoring and Detection

This chapter provides recommendations for using Oracle Enterprise Manager to monitor and maintain a highly available environment acro ss all tiers of the application stack. In addition, it describes how to create an Enterprise Manager configuration that is highly ava ilable.

< font face="Arial, Helvetica, sans-serif" color="#330099">Overview of Monitoring and Detection for High Availability

Continuous monitoring of the system, network, database operations, application, and o ther system components ensures early detection of problems. Early detection improves the user's system experience because problems ca n be resolved faster. In addition, monitoring captures system metrics to indicate trends in system performance growth and recurring p roblems. This information can facilitate prevention, enforce security policies, and manage job processing. For the database server, a sound monitoring system needs to measure availability and detect events that can cause the database server to become unavailable and provide immediate notification to responsible parties for critical failures.

The monitorin g system itself needs to be highly available and adhere to the same operational best practices and availability practices as the reso urces it monitors. Failure of the monitoring system leaves all systems that it monitors unable to capture diagnostic data or alert th e administrator of problems.

Oracle Enterprise Manager provides the management and monitori ng capabilities with many different notification options. This chapter provides recommendations for using Enterprise Manager to monit or and maintain a highly available environment across all tiers of the application stack. Recommendations are available for methods o f monitoring the environment's availability and performance and for using the tools in response to changes in the environment. In add ition, there is a description of how to create an Enterprise Manager configuration that is highly available, as well as additional co nfiguration tips.

Using Enterprise Manager for System Monitoring

This section provides an overview of the concepts and facilities available in Enterprise Manager.

A maj or benefit of Enterprise Manager is its ability to manage components across the entire application stack from the host operating syst em to a user or packaged application. Enterprise Manager treats each of the layers in the application as a targe t. Targets, such as databases, application servers, and hardware, can then be viewed along with other targets of the same ty pe or can be grouped together by application type. All targets can also be reviewed in a single view. Each target type has a default generated home page that displays a summary of relevant details for a specific target. Different types of targets can be grouped toge ther by function, that is, as resources that support the same application.

Every target is monitored by an Oracle Management Agent. Every Management Agent runs on a machine and is responsible for a set of targets. The target s can be on a machine that is different from the machine that the Management Agent is on. For example, a Management Agent can monitor a storage array that cannot host an agent natively. When a Management Agent is installed on a host, the host is automatically discov ered along with other targets that are on the machine.

The Grid Control home page shown in Figure 8-1 provides a picture of the availability of all of the discovered targets.

Figure 8-1 Grid Control Home Page

Text description of initial.gif follows.

Text des cription of the illustration initial.gif

The Grid Control home page shows the following major kinds of information:

Alerts are generated by a combination of factors and are defined on specific metrics. A metric is a data poin t sampled by a Management Agent and sent to the Oracle Management Repository. It could be the availability of a component through a s imple heartbeat test or an evaluation of a specific performance measurement such as "disk busy" or percentage of processes waiting fo r a specific wait event.

There are four states that can be checked for any metric: error, w arning, critical, and clear. The administrator must make policy decisions such as:

All of these decisions are predicated on the business needs of the system. For example, all components may be monitored for availability, but some systems may be monitored only during business hours. Systems with specific performance problems can have additional performance tracing enabl ed to debug a problem.

The rest of this section includes the following topics:

Set Up Default Notification Rules for Each System

Notification Rules are defined sets of alerts on metrics that are autom atically applied to a target when it is discovered by Enterprise Manager. For example, an administrator can create a rule that monito rs the availability of database targets and generates an e-mail message if a database fails. After that rule is generated, it is appl ied to all existing databases and any database created in the future. Access these rules by navigating to Prefer ences and then choosing Rules.

The rules monitor pro blems that require immediate attention, such as those that can affect service availability and Oracle or application errors. Service availability can be affected by an outage in any layer of the application stack: node, database, listener, and critical application d ata. A service availability failure, such as the inability to connect to the database, or the inability to access data critical to th e functionality of the application, must be identified, reported, and reacted to quickly. Potential service outages such as a full ar chive log directory also need to be addressed correctly to avoid a system outage.

Enterpris e Manager provides a series of default rules that provide a strong framework for monitoring availability. A default rule is provided for each of the preinstalled target types that come with Enterprise Manager. These rules can be modified to conform to the policies o f each individual site, and new rules can be created for site-specific targets or applications. The rules can also be set to notify u sers during specific time periods to create an automated coverage policy.

Consider the foll owing recommendations:

Figure 8-2 shows the Not ification Rule property page for choosing availability states. Down, Agent Unreachable, Agent Unreachable Resolved, and Metric Error Detected are chosen.

Figure 8-2 Setting Noti fication Rules for Availability

Text description of avail.gif follows.

Text description of the illustration avail.gif

In addition, modify the metrics monitored by the database rule to report the metrics shown in Table  8-1, Table 8-2, and Table 8-3. This ensures that th ese metrics are captured for all database targets and that trend data will be available for future analysis. All of the events descri bed in Table 8-1, Table 8-2, and Table 8-3 can be accessed from the Database Homepage. Choose All Metrics > Expand All.

Space management conditions that have the potential to cause a service outage should be monitored using the events shown in Table& nbsp;8-1.

Table 8-1 Recommendations for Monitoring Space  
Metric Recommendation

Tablespace Space Used (%)

Set this metric to monitor root file systems for any critical hardware server. This metric enables the admi nistrator to choose the threshold percentages that Enterprise Manager tests against, as well as the number of samples that must occur in error before a message is generated to the administrator. The recommended default settings are 70 percent for a warning and 90 pe rcent for an error, but these should be adjusted depending on system usage. This metric can be customized to monitor only specific ta blespaces.

This metric and similar events can be set in the Tablespace Full metric group.

Archiver Hung Aler t Log Error

Set this metric to monitor the alert log for ORA-00257 errors, which indicate a full archive log directory.

This metric can be set in the Alert Log Error Status metric group.

Archive Area Used(%)

Set this metric with thresholds an d an appropriate sampling time. This metric can alert the administrator about a full archive directory, which can stop the system. Th e recommended default settings are 70 percent for a warning and 90 percent for an error, but these should be adjusted depending on sy stem usage.

This metric can be set in the Archive Area metric group.

Dump Area Used (%)

Set this metric to monitor the dump directory destinations. Dump space must be availabl e so that the maximum amount of diagnostic information is saved the first time an error occurs. The recommended default settings are 70 percent for a warning and 90 percent for an error, but these should be adjusted depending on system usage. < /p>

This metric can be set in the Dump Area metric group.

From the Alert Log Metric group, set Enterprise Manager to monitor the alert log for errors as shown in Table 8-2.

Table 8-2 Recommendation for Monitoring the Alert Log  
Metric Recommendation

Alert

Set this metric to send an alert when an ORA-6XX, ORA-1578 (database corruption), or ORA-0060 (deadlock de tected) error occurs. If any other error is recorded, then a warning message is generated.

Data Block Corruption

Set this metric to monitor the alert log for ORA-01157 and ORA-27048 errors. They signal a corruption i n an Oracle Database datafile.

Data Guard Log Transport

Set this metric.

Monitor the system to ensure that the processing capacity is not exceeded. The warning and critical levels for these events should be modified based on the usage pattern of the system. Set the events from the Database Limits metric group. Table 8-3 contains the recommendations.

< /a>
Table 8-3 Rec ommendations for Monitoring Processing Capacity  
Metric Recommendation

Process limit

Set thres holds for this metric to warn if the number of current processes approaches the value of the PROCESSES initialization pa rameter.

Session limit

Set thresholds for this metric to warn if the instance is ap proaching the maximum number of concurrent connections allowed by the database.

Figure 8-3 shows the Notification Rule property page for setting choosing metri cs. The user has chosen Critical and Warning as the severity states for notification. The list of Available Metrics is shown in the l eft list box. The metrics that have been selected for notification are shown in the right list box.

Figure 8-3 Setting Notification Rules for Metrics

Text description of metrics.gif follows.

Text descriptio n of the illustration metrics.gif

See Also:

Oracle Database 2 Day DBA for information about setting up notification rules and metric th resholds

Use Database Target Views to Monitor Health, Availability, and Performance

The Database Targets page in Figure 8-4 shows an overview of sy stem performance, space utilization, and the configuration of important availability components like archived redo log status, flashb ack log status, and estimated instance recovery time. Alerts are displayed immediately. Each of the alert values can be configured fr om links on this page

Figure 8-4 Overview of System Performance

Text description of overv.gif follows.

Text description of the illustration overv.gif

Many of the me trics from the Enterprise Manager pertain to performance. A system without adequate performance is not an HA system, regardless of th e status of any of the individual components. While performance problems seldom cause a major system outage, they can still cause an outage to a subset of customers. Outages of this type are commonly referred to as application service brownouts< /strong>. The primary cause of brownouts is the intermittent or partial failure of one or more infrastructure components. IT managers must be aware of how the infrastructure components are performing (their response time, latency, and availability) and how they are affecting the quality of application service delivered to the end user.

A performance basel ine, derived from normal operations that meet the SLA, should determine what constitutes a performance metric alert. Baseline data sh ould be collected from the first day that an application is in production and should include the following:

You can use Enterprise Manager to capture a snapshot of database performance as a baseli ne. Enterprise Manager compares these values against system performance and displays the result on the database Target page. It can a lso send alerts if the values deviate too far from the established baseline.

Set the databa se notification rule to capture the metrics listed in Table 8-4 for all database targets. Anal ysis of these parameters can then be done using one tool and historical data will be available.

Table 8-4 Recommended Notif ication Rules for Metrics  
Metric Recom mendation

Disk I/O per Second

This is a database-level metric that monitors I/O operations done by the database. It sends an alert when the number of operations exceeds a user-defined thre shold. Use this metric with operating system-level events that are also available with Enterprise Manager.

Set this metric based on the total I/O throughput available to the system, the number of I/O channels available, netwo rk bandwidth (in a SAN environment), the effects of the disk cache if you are using a storage array device, and the maximum I/O rate and number of spindles available to the database.

% CPU Busy

Set this metric to war n at 75 percent and to show a critical alert between 85 percent and 90 percent. This usage may be normal at peak periods, but it may also be an indication of a runaway process or of a potential resource shortage.

% Wait Time

Excessive idle time indicates that a bottleneck for one or more resources is occurring. Set this metric based on the system wait time when the application is performing as expected.

Network Bytes per Second

This metric reports network traffic that Oracle generates. It can indicate a potential network bottleneck. Set this metric based in a ctual usage during peak periods.

Total Parses per Second

This metric measures SQL p erformance. It can indicate an application change or change in usage that has created a shortage of resources. Set it based on peak p eriods.

Se e Also:

Oracle Database Performance Tuning Guide for more information about performance monitoring

Use E vent Notifications to React to Metric Changes

There are many operati ng system events that can be used to supplement a suggested metric. Such operating system events are not required for each host and i nstance. All metrics defined here can be set individually by instance or database using the Manage Metrics link at the bottom of the navigation bar of the object target page. The values that trigger a warning or critical alert can be ch anged here, and an operating system script can be activated to respond to an metric threshold, in addition to the standard alert bein g generated to the Oracle Enterprise Manager 10g Grid Control.

Use Events to Monitor Data Guard system Availability

Set Enterprise Manager metrics to monitor the availability of logical and physical Data Guard configurations. If a Data Guard environment is registered with the Data Guard Manager extension of Enterprise Manager, then set the events shown in

Table 8-5 Recommendations for Setting Data Guard Events  < /h5>
Metric Recommendation

Data Guard Status

Set this metric to be notified of system problems in a Data Guard configuration.< /p>

Data Not Applied

Set this metric to be notified when the gap (measured in minutes) between the last archived redo log received and the last log applied on the standby database exceeds a user-defined threshold. This i nformation can be used to warn the administrator if the recovery time for a standby instance will exceed the defined outage recovery service level. Set this metric based on the specifications for log application for the standby database.

Data Not Received

Set this metric to be notified if there is an extended delay in moving archived redo logs fro m the production database to the standby database. This metric occurs when the difference between the number of archived redo logs on the production database and the number of archived redo logs shipped to the standby site exceeds a user-defined threshold. The thres hold should be based on the amount of time it takes to transport an archived redo log across the network.

Set the sample time for the metric to be approximately the log transport time, and set the number of occurrences to be 2 or greater to avoid false positives. Recommended starting values for the warning and critical thresholds are 1 and 2.

Managing the HA Environment with Enterprise Manager

Use Enterprise Manager as a proactive part of administering any system as well as for problem notification and analysis . This section includes the following recommendations:

Check Enterprise Manager Policy Violations

Enterprise Manager comes with a pre-installed set of policies and recommendations of best practices for all database s. These policies are checked by default, and the number of violations is displayed on the Targets page in Figure 8-4. Select Policy Violations from the Targets page to see a list of all violations .

Use Ent erprise Manager to Manage Oracle Patches and Maintain System Baselines

You can use Enterprise Manager to download and manage patches from http://metalink.orac le.com for any monitored system in the application environment. A job can be set up to routinely check for patches that ar e relevant to the user environment. Those patches can be downloaded and stored directly in the Management Repository. Patches can be staged from the Management Repository to multiple systems and applied during maintenance windows.

You can examine patch levels for one machine and compare them between machines in either a one-to-one or one-to-many relations hip. In this case, a machine can be identified as a baseline and used to demonstrate maintenance requirements in other machines. This can be done for operating system patches as well as database patches.

Use Enterprise Manager to Manage Data Guard Targets

Enterprise Manager can be used to set up logical and physical standby databases for any database target. It also provides the ability to manage switchover and failover of database targets other than the database that con tains the Management Repository.

Enterprise Manager can also be used to monitor the health of a Data Guard configuration at a glance. From any database target page, navigate to the Data Guard status section by using the link in the High Availability section. The page shows the active standby databases for the primary target, the amount of log data waiting for shipment and receipt by the standby database and the data protection mode. You can also modify the data protection mode from thi s page.

This page contains a link to the Verify function, whi ch checks the Data Guard environment and log transport services and displays warnings and errors. The Verify function must be run man ually; it is not automatic.

Highly Available Architectures for Enterprise Manager

The Enterprise Manager architecture consists of a three-tier framework as shown in Figure 8-5 .

Figure 8-5 Enterprise Manager Architec ture

Text description of maxav033.gif follows

Text description of the illustration maxav033.gif

The components of the architecture are as follows:

  • Web-based Grid Control: The E nterprise Manager user interface for centrally managing the entire computing environment from one location. All of the services withi n the enterprise, including hosts, databases, listeners, application servers, HTTP Servers and Web applications are easily managed as one cohesive unit.
  • Oracle Management Service and Oracle Management Repository : The Management Service is a J2EE Web application that renders the user interface for the Grid Control, works with all Management Ag ents in processing monitoring and job information, and uses the Management Repository as its data store. The Management Repository co nsists of tablespaces in an Oracle database that contain information about administrators, targets, and applications that are managed within Enterprise Manager.
  • Oracle Management Agents: Management Agents are pr ocesses that are deployed on each monitored host. The Management Agent is responsible for monitoring all targets on the host, for com municating that information to the Management Service, and for managing and maintaining the host and the products installed on the ho st. The managed targets in the figure include the database, the third-party application, and the application server.
  • The Database Control enables you to monitor and administer a single Oracle Database instance o r a clustered database.
  • The Application Server Control enables you to monitor and administer a single Oracle Application Server instance, a farm of Oracle Application Server instances, or Oracle Application Serv er Clusters.

Enterprise Manager provides a detailed set of tools to monitor itself.

The Management System page is a predefined component of Enterprise Manager that shows the ad ministrator an overview of the Enterprise Manager components, backlogs in processing agent data, and component availability.

The Management System page shows essential metrics, including the amount of space left in the repos itory and the amount of data waiting to be loaded to the repository. This page also provides a view of alerts or warnings against the management system. The Repository Operations page provides an overview of the individual component tasks that make up the management system. The Repository Operations page shows the individual components at a glance, including the amount of CPU resource consumed an d processing errors. A default notification rule is created when the product is installed and should be configured to notify the syst em administrator of a problem with any Enterprise Manager component.

Set the following opti ons to monitor an Enterprise Manager environment:

  • Modify the R epository Operations Notification rule to provide updates on Management Service Status, Targets Not Providing Data, and Total Loader Run Time. Access this rule from the Notifications Rules page. See "Set Up Default Notification Rules fo r Each System".
  • Update the emd.pro perties with a valid e-mail address and mail server for any agent that monitors an Management Service or Management Repository node. This provides Enterprise Manager an additional method of notification if the repository fails. Instructions for setting emd.properties are in the Tip section of the Grid Control home page.

The rest of this section includes the following topics:

Recommendations for an HA Architecture for Enterprise Ma nager

The following recommendations are described in this section:

Protect the Repository and Processes As Well as the Configuration They Monitor

Availability requirements need to be addressed for each layer of the Enterprise Manager stack. The minimum re commendation for the Enterprise Manager repository and processes is to host them in a configuration that has the same protections as the system with the highest level of availability monitored by Enterprise Manager. The Enterprise Manager architecture must be as rel iable as the application architecture. It is crucial for the monitoring framework to detect problems and manage repair as efficiently as possible. The Enterprise Manager implementation should be designed to be as available as the most available application it monito rs because the Enterprise Manager framework is used to generate alerts if any monitored application fails.

Place the Management Repository in a RAC Instance and Use Data Guard

The Management Repository is the f oundation of all Enterprise Manager operations. If the Enterprise Manager system is being used to monitor and alert on a system using a RAC and Data Guard configuration, but the Management Repository is hosted only on a single instance, then an outage of the Enterpr ise Manager system puts the administrator at risk of not being notified in a timely fashion of problems in production systems. Consid er placing the Management Repository in a RAC instance to protect from individual instance failure and using Data Guard to protect fr om site failure.

Configure At Least Two Management Service Processes and Load Balance Them

For the middle tier, the baseline recommendation is to have a minimum of two Management Service processes, using a hard ware server load balancer to mask the location of an individual Management Service process and a failure of any individual component. This provides immediate coverage for a single failure in the most critical components in the Enterprise Manager architecture with li ttle interruption of service for all systems monitored using Enterprise Manager. Hardware server load balancers can also be monitored and configured using Enterprise Manager, providing coverage across the operating system stack. Management Service processes connect to the repository instances using Oracle Net.

Consider Hosting Enterprise Manager on the Same Hardware as an HA System

To reduce hardware overload and use current resources, the repository and Management Servic e processes can be hosted on the same hardware as another highly available production system. This assumes that the secondary site ha s the capacity and bandwidth to handle the production load plus an active Enterprise Manager repository and Management Service proces s. A hardware service load balancer should be used as a front end for multiple Management Service processes to manage failure of an i ndividual Management Service and to balance the workload across the middle tier.

Agents fro m any monitored node in the environment can connect to any active Management Service processes. Load balancing of the agent processes connecting to the Management Service processes is handled internally by Enterprise Manager.

Monitor the Network Bandwidth Between Processes a nd Agents

Sufficient network bandwidth must be available to support the communication between the Management Service processes and the Management Agents. If the repository is used to manage a larger en terprise, then communication between agents and Management Service processes can be significant, depending on the number of scheduled events and jobs. If the Enterprise Manager framework is used to monitor multiple applications and more dedicated system resources ar e required, then consider scaling the Management Repository and Management Service processes with additional nodes. The Management Re pository and Management Service processes can be scaled independently. If required, additional hardware outside of the cluster can be added to scale the number of Management Service processes.

Unscheduled Outages for Enterprise Manager

Enterprise Manager is the primary control interface for managing your data center. An outage of Enterprise Manager causes a critical lack of visibility into the performance and availability metrics that allow the DBA to manage overall syste m performance. Table 8-6 describes the outages that can occur to any of the tiers involved in Enterprise Manager and how to recover from each outage.

Table 8-6 Unscheduled Outages for Enterprise Manager  
Type of Outage Possible Reasons for Outage< /font> Solutions or Alternatives

Management Repository instance failure

Hardware failure, Oracle database failure, network failure to a single node of a RAC instance on the primary si te, listener failure

This is best managed by using a RAC environmen t for the Management Repository.

In a RAC environment, connections reconnect to the second n ode using Oracle Net failover. When the failed node is restored, the load is rebalanced automatically.

Primary site failure

Network outage to both nodes, cluster failure, interconnect failure, hardware failure

Requires Data Guard failover to secondary site:

  • Stop the listeners configured for Enterprise Manager traffic on the primary site nodes< /li>
  • Perform a Data Guard failover for the Management Repository
  • Start the Enterprise Manager listeners and Management Service processes on the new production si te

Note: This cannot be managed by Enterprise Manager d irectly. See "Data Guard Failover Using SQL*Plus".

Management Agent failure

Process failure, accidental user termination

The Management Agent watchdog process restarts the Management Agent. The number of restarts is bounded by user-configurable parameters to avoid unnecessary processing on the monitored node.

Watchdog failure

Process failure, accidental user termination

No data is reported. Logging stops for the Management Agent. Any hanging processes must be manually stopped, and the Manageme nt Agent must be restarted.

Note: The watchdog failure is not reported back to the Enterprise Manager GUI.

System state data is deleted or corrupted

Agent failure, user deletion of state files

Stop Management Age nt processes (EMAGENT and EMWD) that are still running.

Restart th e agent (emctl start agent).

Management Service process failure

Proces s failure, accidental user termination

The Oracle Process Manager a nd Notification Server (OPMN) restarts the Management Service.

A server load balancer (SLB) can be used for multiple Management Service processes. This masks process failures and distributes the workload across the middle tie r.

Failure of a Management Service causes the GUI session connected to it to fail. The GUI s ession must be restarted on a surviving Management Service.

Oracle Process Manager and Notification Server (OPMN) failure

Process failure, accidental user termination

No data is reported; logging stops for the Management Service process. Hanging processes need to be manually killed (on UNIX platforms) and the agent needs to be restarted.

Note: Death of the watchdog is not reported back to the Enterprise Manager GUI.

Grid Control disconnect

Grid Control loses connection to Management Service because of a network problem, Management Service failure, or node failure

Because the Grid Control is stateless, it receives data fr om Management Service processes. The failure is resolved by connecting to a surviving Management Service or by starting the Grid Cont rol itself.

Additional Enterprise Manager Configuration

This section contains additional configuration information that will be helpful in building Enterprise Manager in an MAA environ ment. It includes the following topics:

Configure a Separate Listener for Enterprise Manager

Traffic from the Management Agents is routed to the Management Service processes and then to the Management Repository by Oracle Net. To isolate this traffic from other application traffic and to support Data Guard if required for site failover, configure the Enterprise Manager traffic through a separate listener. The listener is acti ve only on the node where the active Enterprise Manager instance is running. Do not set the GLOBAL_DBNAME parameter in t he listener.ora file because setting it disables Transparent Application Failover (TAF) and connect-time failover. Confi gure the LOCAL_LISTENER and REMOTE_LISTENER initialization parameters to enable dynamic service registratio n and cross-registration. The following is an example of a listener configuration:

LISTENER
_N1=
        (description_list=
             (description=

       (address_list=
                (address=(protocol=tcp)(port=1521)(host=EMPRIM1.us.oracle.com))
               )
             )
        )
SID_LIST_LIS
TENER_N1 =
  (SID_LIST =
    (SID_DESC =
      (ORACLE_HOME = /mnt/
app/oracle/product/10g)
      (SID_NAME = EM1)
    )
  )

LISTENER_DG=
        (description_list=

      (description=
               (address_list=
                (address=(protocol=tcp)
(port=1529)(host=EMPRIM1.us.oracle.com))
               )
             )
        )

Install the Management Repository Into an Existing Database

To avoid installation problems when building a RAC-based repository, it is easier to install the Enterprise Manager into an existin g database and build any tablespaces in advance. Certain versions of the Oracle Universal Installer do not handle installing the repo sitory into a RAC database. Build the database first; then use the Install option to install into an existing database.


Go to previous page
Previous
Go to next page Next
Oracle
Copyright © 2003 Oracle Corporation
All Rights Reserved.
Go t
o Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents Contents Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback