One of our Avaloq Databases crashed with the following information in the alert log:
Wed Mar 02 19:10:22 2016 PMON (ospid: 26747): terminating the instance due to error 471 Wed Mar 02 19:10:23 2016 System state dump requested by (instance=1, osid=26747 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u00/oracle/orabase/diag/rdbms/ora5prim/ORA5/trace/ORA5_diag_26774_20160302191023.trc Instance terminated by PMON, pid = 26747
Contents of the Pmon trace file:
Background process DBW0 found dead Oracle pid = 10 OS pid (from detached process) = 26780 OS pid (from process state) = 26780 dtp = 6001ba70, proc = 3a46b1428
Assuming DBW0 died there should be some information in the dbw trace file. Unfortunately there was nothing about the crash in the trace file. So DBW0 crashed without leaving any trace information. How is this possible? After searching the Oracle Metalink for any bugs which might be related to the case I remembered that we had problems in the past where Avaloq accidentally killed a critical Oracle process.
So I checked the OS Audit logs:
exec_args,3,/usr/bin/ksh,/u00/oracle/orabase/admin/ORA9/asmd/bin/modelcache.sh,stop subject,900074,1000,1000,1000,1000,21383,1709692623,4728 5632 workstation2.domain.company return,success,0 zone,orahost02t header,229,2,execve(2),,svglb56t,2016-03-02 19:10:22.526 +01:00 path,/zones/orahost02t/root/usr/bin/cat attribute,100555,root,bin,65616,201861,18446744073709551615 exec_args,2,cat,/u00/oracle/orabase/admin/ORA9/asmd/var/modelcache_0.pid subject,900074,1000,1000,1000,1000,21386,1709692623,4728 5632 workstation2.domain.company
So modelcache.sh was executed with the “stop” parameter and it was reading the following file:/u00/oracle/orabase/admin/ORA9/asmd/var/modelcache_0.pid
So what does this script do when using the stop parameter:
eval PID=`cat $PIDFILE`
if [ "$?" = 0 ]; then
ps -p $PID > /dev/null 2>&1
if [ "$?" = 0 ]; then
#Slave is running. Terminate.
kill -9 $PID
rm $PIDFILE
After reading the PID from the Pidfile it terminates the process with “kill -9”. The script does not verify whether the slave process belongs to it or not. Now what happens if the slave is dead already and another process got this PID? The script will terminate that process.
Although the Pidfile was already deleted I was able to restore it:
$cat /u00/oracle/orabase/admin/ORA9/asmd/var/modelcache_0.pid
26780
So the script killed the process with PID 26780 which was the DBW0 process at that time. That explains why there was no information about the crash in the dbw0 trace file.
BugFix from Avaloq:
Avaloq modified the modelcache script to include a function which verifies the process creation date by issuing the “ls -ld /proc/$PID” command. With this fix it is not possible anymore that the script kills the wrong process.