Avaloq script terminating Oracle Processes
By: Date: March 12, 2016 Categories: Avaloq,Database Crash,modelcache,Oracle

One of our Avaloq Databases crashed with the following information in the alert log:

Wed Mar 02 19:10:22 2016
PMON (ospid: 26747): terminating the instance due to error 471
Wed Mar 02 19:10:23 2016
System state dump requested by (instance=1, osid=26747 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u00/oracle/orabase/diag/rdbms/ora5prim/ORA5/trace/ORA5_diag_26774_20160302191023.trc
Instance terminated by PMON, pid = 26747

Contents of the Pmon trace file:

Background process DBW0 found dead
Oracle pid = 10
OS pid (from detached process) = 26780
OS pid (from process state) = 26780
dtp = 6001ba70, proc = 3a46b1428

Assuming DBW0 died there should be some information in the dbw trace file. Unfortunately there was nothing about the crash in the trace file. So DBW0 crashed without leaving any trace information. How is this possible? After searching the Oracle Metalink for any bugs which might be related to the case I remembered that we had problems in the past where Avaloq accidentally killed a critical Oracle process.

So I checked the OS Audit logs:

 exec_args,3,/usr/bin/ksh,/u00/oracle/orabase/admin/ORA9/asmd/bin/modelcache.sh,stop
 subject,900074,1000,1000,1000,1000,21383,1709692623,4728 5632 workstation2.domain.company
 return,success,0
 zone,orahost02t
 header,229,2,execve(2),,svglb56t,2016-03-02 19:10:22.526 +01:00
 path,/zones/orahost02t/root/usr/bin/cat
 attribute,100555,root,bin,65616,201861,18446744073709551615
 exec_args,2,cat,/u00/oracle/orabase/admin/ORA9/asmd/var/modelcache_0.pid
 subject,900074,1000,1000,1000,1000,21386,1709692623,4728 5632 workstation2.domain.company

So modelcache.sh was executed with the “stop” parameter and it was reading the following file:/u00/oracle/orabase/admin/ORA9/asmd/var/modelcache_0.pid

So what does this script do when using the stop parameter:

eval PID=`cat $PIDFILE`
 if [ "$?" = 0 ]; then
 ps -p $PID > /dev/null 2>&1
 if [ "$?" = 0 ]; then
 #Slave is running. Terminate.
 kill -9 $PID
 rm $PIDFILE

After reading the PID from the Pidfile it terminates the process with “kill -9”.  The script does not verify whether the slave process belongs to it or not. Now what happens if the slave is dead already and another process got this PID? The script will terminate that process.

Although the Pidfile was already deleted I was able to restore it:

$cat /u00/oracle/orabase/admin/ORA9/asmd/var/modelcache_0.pid
 26780

So the script killed the process with PID 26780 which was the DBW0 process at that time. That explains why there was no information about the crash in the dbw0 trace file.

 

BugFix from Avaloq:

Avaloq modified the modelcache script to include a function which verifies the process creation date by issuing the “ls -ld /proc/$PID” command. With this fix it is not possible anymore that the script kills the wrong process.

Leave a Reply

Your email address will not be published. Required fields are marked *