H165892: ORACLE TERMINATES WITH FAILURE OF NODE IN A TWO-NODE CLUSTER TEXT: Oracle database instance terminates after failure of one node in a two-node cluster. SYMPTOMS: 1. When running a two node Oracle Parallel Server (OPS) database and one of the nodes fails, the remaining node also fails after a period of time ranging from 30 minutes to several hours. 2. Clients connected to the remaining node begin to time-out and stop. 3. Any of the following Oracle messages are seen: a) ORA-00472: PMON process terminated with error b) ORA-27103: internal error c) ORAnnnnn.TRC file contains the message "FATAL ERROR IN TWO TASK SERVER error 12571 4. The file %oraclehome%\database\io.log is very large (2MB or greater) and contains the following lines repeated since the time of the first node failure until the time the database terminated on the second node: IOInService()... IOInService(OK)... IOOutOfService()... IOOutOfService(OK)... PROBLEM ISOLATION AIDS: - The system is a Netfinity 7000 M-10 server, Type 8680, any supported Model configured with Oracle Parallel Server. Note: Supported configurations are listed at the following URL: http://www.pc.ibm.com/us/compat/clustering/matrix.shtml look for "Oracle Parallel Server" - The system is configured with the following option: IBM Netfinity Cluster Enabler Software - NOS affected: Windows NT 4.0 Enterprise Edition with Service Pack 4 applied. FIX: This problem has been reported to Oracle, and Oracle Bug number 812552 has been opened. The fix will be determined by Oracle and is expected to be provided as a new patchset for Oracle V8.0.5. WORKAROUND: The problem is related to the load placed on the nodes in the cluster and whether the client programs have a time-out period of less than about 30 minutes. To avoid the problem, one or both of the following recommendations are made: 1. Limit the load on the two node cluster so that if the entire load was placed onto a single node, the database could still handle the load with some capacity to spare. 2. Increase the time-out period for client programs to at least 30 minutes. DETAILS: The termination of the database on the remaining node is caused by the extra heavy load placed on the remaining node after the first node terminates. Rather than degrading performance gracefully, the Oracle database may periodically pause for up to 20 minutes. If clients have time-out periods shorter than this pause time, they will time-out. If those client programs then terminate, the Oracle database must free up the resources (threads) allocated by the Oracle server for those clients. If many clients terminate all at once, the Oracle database does not free up these resources correctly, and the remaining database instance terminates. The workarounds address this problem by avoiding the pause condition and/or avoiding the termination of many clients at once. This problem has not been seen on clusters with a greater number of nodes since the workload of a single failed node tends to be distributed among the multiple remaining nodes. In this case, the database does not pause and clients do not time-out. TRADEMARKS: Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States and/or other countries. Other company, product and service names may be the trademarks or service marks of others. DATE: February 17, 1999