H165892:  ORACLE TERMINATES WITH FAILURE OF NODE IN A TWO-NODE CLUSTER


 TEXT:

 Oracle database instance terminates after failure of one node
 in a two-node cluster.

 SYMPTOMS:

 1. When running a two node Oracle Parallel Server (OPS)
 database and one of the nodes fails, the remaining node also
 fails after a period of time ranging from 30 minutes to several
 hours.

 2. Clients connected to the remaining node begin to time-out
 and stop.

 3. Any of the following Oracle messages are seen:
      a) ORA-00472:  PMON process terminated with error

      b) ORA-27103:  internal error

      c) ORAnnnnn.TRC file contains the message "FATAL ERROR IN
         TWO TASK SERVER error 12571

 4. The file %oraclehome%\database\io.log is very large (2MB or
 greater) and contains the following lines repeated since the
 time of the first node failure until the time the database
 terminated on the second node:

      IOInService()...
      IOInService(OK)...
      IOOutOfService()...
      IOOutOfService(OK)...

 PROBLEM ISOLATION AIDS:

 - The system is a Netfinity 7000 M-10 server, Type 8680,
   any supported Model configured with Oracle Parallel Server.

 Note: Supported configurations are listed at the following URL:

      http://www.pc.ibm.com/us/compat/clustering/matrix.shtml
      look for "Oracle Parallel Server"

 - The system is configured with the following option:

      IBM Netfinity Cluster Enabler Software

 - NOS affected:

      Windows NT 4.0 Enterprise Edition with
      Service Pack 4 applied.

 FIX:

 This problem has been reported to Oracle, and Oracle Bug number
 812552 has been opened.

 The fix will be determined by Oracle and is expected to be
 provided as a new patchset for Oracle V8.0.5.

 WORKAROUND:

 The problem is related to the load placed on the nodes in the
 cluster and whether the client programs have a time-out period
 of less than about 30 minutes.  To avoid the problem, one or
 both of the following recommendations are made:

      1. Limit the load on the two node cluster so that if the
      entire load was placed onto a single node, the database
      could still handle the load with some capacity to spare.

      2. Increase the time-out period for client programs to at
      least 30 minutes.

 DETAILS:

 The termination of the database on the remaining node is caused
 by the extra heavy load placed on the remaining node after the
 first node terminates. Rather than degrading performance
 gracefully, the Oracle database may periodically pause for up
 to 20 minutes. If clients have time-out periods shorter than
 this pause time, they will time-out. If those client programs
 then terminate, the Oracle database must free up the resources
 (threads) allocated by the Oracle server for those clients. If
 many clients terminate all at once, the Oracle database does
 not free up these resources correctly, and the remaining
 database instance terminates.

 The workarounds address this problem by avoiding the pause
 condition and/or avoiding the termination of many clients at
 once. This problem has not been seen on clusters with a greater
 number of nodes since the workload of a single failed node
 tends to be distributed among the multiple remaining nodes. In
 this case, the database does not pause and clients do not
 time-out.

 TRADEMARKS:

 Microsoft, Windows, Windows NT, and the Windows logo are
 trademarks of Microsoft Corporation in the United States and/or
 other countries.

 Other company, product and service names may be the trademarks
 or service marks of others.


DATE: February 17, 1999