Focus On Oracle,MySQL,SQL Server and PostgreSQL..

database crash with ora-00494

—

由

于 Oracle rdbms, troubleshooting

database crash with ora-00494

群论坛一个网友的问题，原帖地址：

http://databasefans.com/read.php?tid=273

###### alert 如下 ######

Thu Mar 15 22:59:38 2012
Errors in file /app/oracle/admin/cn3fsa/bdump/cn3fsa_arc1_299280.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 524594'
Thu Mar 15 22:59:40 2012
System State dumped to trace file /app/oracle/admin/cn3fsa/bdump/cn3fsa_arc1_299280.trc
Thu Mar 15 22:59:41 2012
Thread 1 cannot allocate new log, sequence 167905
Checkpoint not complete
  Current log# 5 seq# 167904 mem# 0: /data02/oradata/cn3fsa/redo5a.rdo
  Current log# 5 seq# 167904 mem# 1: /data03/oradata/cn3fsa/redo5b.rdo
Thu Mar 15 22:59:47 2012
Thread 1 advanced to log sequence 167905 (LGWR switch)
  Current log# 7 seq# 167905 mem# 0: /data04/oradata/cn3fsa/redo7a.rdo
  Current log# 7 seq# 167905 mem# 1: /data01/oradata/cn3fsa/redo7b.rdo
Thu Mar 15 22:59:48 2012
Killing enqueue blocker (pid=524594) on resource CF-00000000-00000000
 by killing session 550.1
Thu Mar 15 22:59:50 2012
Errors in file /app/oracle/admin/cn3fsa/bdump/cn3fsa_ckpt_524594.trc:
ORA-00028: your session has been killed
Thu Mar 15 22:59:50 2012
CKPT: terminating instance due to error 28
Instance terminated by CKPT, pid = 524594
Fri Mar 16 08:50:50 2012
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned off.
IMODE=BR
ILAT =61
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:
  processes                = 500
  sessions                 = 555

从上面错误，我们可以判断是arch1无法获得CF锁，而阻塞者 524594 很可能是后台核心进行，以至于
被kill后导致实例crash了。

这里需要说明一下的是该机制是这样的：

当进行需要获取CF enqueue时，最大等待时间为900s，超过900s时，将强制kill阻塞进程。通过如下隐含
参数控制这个timeout.

SQL> show parameter controlfile_enq

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_controlfile_enqueue_dump            boolean     FALSE
_controlfile_enqueue_holding_time    integer     120
_controlfile_enqueue_timeout         integer     900
_kill_controlfile_enqueue_blocker    boolean     TRUE

下面我们来看看进程 524594 是什么进程？

通过trace我们可以发现，如下：

Unix process pid: 524594, image: oracle@lgcenjfsa (CKPT)

我们可以发现，原来是ckpt进程被干掉了，导致数据库crash了，那么我们再进一步分析，
是什么进程kill ckpt进程了呢？从上面alert log来看，error是写到arch1 trace里面的，
所以这里我想应该是跟lgwr进程有关系。

从网友提供的trace来看，其中lgwr trace都是如下信息：

*** 2012-03-15 22:53:53.488
Warning: log write time 600ms, size 7893KB
*** 2012-03-15 22:54:00.669
Warning: log write time 540ms, size 7362KB
*** 2012-03-15 22:54:18.592
Warning: log write time 640ms, size 4040KB
*** 2012-03-15 22:54:19.182
Warning: log write time 590ms, size 7412KB
*** 2012-03-15 22:54:30.110
Warning: log write time 910ms, size 538KB
*** 2012-03-15 22:54:36.899
Warning: log write time 550ms, size 7595KB
*** 2012-03-15 22:54:48.679
Warning: log write time 720ms, size 10229KB
*** 2012-03-15 22:54:55.736
Warning: log write time 510ms, size 7491KB
*** 2012-03-15 22:55:07.782
Warning: log write time 830ms, size 4178KB

说明lgwr在不停写，且在等待。

从另外一个trace里面也可以看到如下信息：

*** 2012-03-15 22:43:48.114
Waited for detached process: CKPT for 310 seconds:
*** 2012-03-15 22:43:48.114
Dumping diagnostic information for CKPT:
OS pid = 524594

*** 2012-03-15 22:44:00.687
Waited for detached process: CKPT for 320 seconds:
*** 2012-03-15 22:44:00.687
Dumping diagnostic information for CKPT:
OS pid = 524594

我们知道检查点也是在不停的工作的，如果logfile切换频繁很可能会导致这个情况。
那么最后我们再来看看ckpt进程的状态信息：

*** 2012-03-15 22:59:38.881
Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)
Possible local blocker ospid=524594 sid=550 sser=1 time_held=674 secs (ges mode req=6 held=4)
DUMP LOCAL BLOCKER: initiate state dump for KILL BLOCKER
  possible owner[7.524594] on resource CF-00000000-00000000
Dumping process info of pid[7.524594] requested by pid[17.299280]
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 524594'
-------------------------------------------------------------------------------
ENQUEUE [CF] HELD FOR TOO LONG

enqueue holder: 'inst 1, osid 524594'

Process 'inst 1, osid 524594' is holding an enqueue for maximum allowed time.
The process will be terminated.

Oracle Support Services triaging information: to find the root-cause, look
at the call stack of process 'inst 1, osid 524594' located below. Ask the
developer that owns the first NON-service layer in the stack to investigate.
Common service layers are enqueues (ksq), latches (ksl), library cache
pins and locks (kgl), and row cache locks (kqr).

Dumping process 7.524594 info:
*** 2012-03-15 22:59:38.968
Dumping diagnostic information for ospid 524594:
OS pid = 524594
loadavg : 1.38 1.97 2.29
swap info: free_mem = 34.10M rsv = 64.00M
           alloc = 7237.58M avail = 16384.00M swap_free = 9146.42M
       F S      UID    PID   PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 T   oracle 524594      1   0  60 20 c1d0d590 108076            Mar 12      -  1:35 ora_ckpt_cn3fsa
open: Permission denied
524594: ora_ckpt_cn3fsa
0x000000010025d950  ksfdaio1(??, ??, ??) + 0x330
0x000000010022c988  kcflbi(??, ??, ??, ??, ??, ??, ??, ??) + 0x6a8
0x00000001010f0970  kcbldio(??, ??, ??) + 0x1eb0
0x00000001010f6468  kcblsltio(??, ??, ??, ??) + 0x548
0x00000001010e3b50  krhahra(??) + 0x310
0x00000001010e711c  krhahini(??, ??, ??, ??) + 0x57c
0x00000001018fc4fc  kcvcfsi(??, ??, ??, ??, ??, ??) + 0x31c
0x00000001018e1744  kcvucpb(??, ??, ??, ??, ??, ??, ??, ??) + 0x1a4
0x00000001018fa344  kcvcca(??, ??) + 0x104
0x000000010015e728  ksbcti(??, ??, ??) + 0x3c8
0x00000001001694fc  ksbabs(??) + 0x25c
0x0000000100166554  ksbrdp() + 0x4b4
0x000000010430c09c  opirip(??, ??, ??) + 0x3fc
0x0000000102d9a598  opidrv(??, ??, ??) + 0x458
0x000000010370b0b0  sou2o(??, ??, ??, ??) + 0x90
0x0000000100000870  opimai_real(??, ??) + 0x150
0x00000001000006d8  main(??, ??) + 0x98
0x0000000100000368  __start() + 0x98
*** 2012-03-15 22:59:40.936
----------------------------------------
SO: 70000024c3b20e0, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=7, calls cur/top: 70000024f5da8f0/70000024f5da8f0, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 114
              last post received-location: kcbbza
              last process to post me: 70000024c3b1900 1 6
              last post sent: 0 0 9
              last post sent-location: ksqrcl
              last process posted by me: 70000024c3b9700 106 2
  (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 70000024c42cbc0
    O/S info: user: oracle, term: UNKNOWN, ospid: 524594
    OSD pid info: Unix process pid: 524594, image: oracle@lgcenjfsa (CKPT)


SO: 70000024f5934a0, type: 4, owner: 70000024c3b20e0, flag: INIT/-/-/0x00
  (session) sid: 550 trans: 0, creator: 70000024c3b20e0, flag: (51) USR/- BSY/-/-/-/-/-
            DID: 0001-0007-00000005, short-term DID: 0000-0000-00000000
            txn branch: 0
            oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
  last wait for 'direct path write' blocking sess=0x0 seq=6421 wait_time=1 seconds since wait started=0
          file number=33, first dba=1, block cnt=1
  Dumping Session Wait History
   for 'direct path write' count=1 wait_time=1
          file number=33, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=1
          file number=34, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=0
          file number=35, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=0
          file number=36, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=0
          file number=37, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=1
          file number=38, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=0
          file number=39, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=0
          file number=3a, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=1
          file number=3b, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=1
          file number=3c, first dba=1, block cnt=1

      service name: SYS$BACKGROUND
    waiting for 'rdbms ipc message' blocking sess=0x0 seq=6432 wait_time=0 seconds since wait started=0
                timeout=11c, =0, =0
    Dumping Session Wait History
     for 'rdbms ipc message' count=1 wait_time=52
                timeout=11d, =0, =0
     for 'rdbms ipc message' count=1 wait_time=150428
                timeout=12c, =0, =0
     for 'log buffer space' count=1 wait_time=4
                =0, =0, =0
     for 'control file sequential read' count=1 wait_time=18
                file#=0, block#=3, blocks=1
     for 'control file parallel write' count=1 wait_time=32552
                files=3, block#=3, requests=3
     for 'control file sequential read' count=1 wait_time=13
                file#=0, block#=1, blocks=1
     for 'control file parallel write' count=1 wait_time=14208
                files=3, block#=1, requests=3
     for 'control file parallel write' count=1 wait_time=16303
                files=3, block#=f, requests=3
     for 'control file parallel write' count=1 wait_time=13450
                files=3, block#=11, requests=3
     for 'control file parallel write' count=1 wait_time=22983
                files=3, block#=1a, requests=3

    ----------------------------------------
      SO: 70000024f682f80, type: 5, owner: 70000024f5da8f0, flag: INIT/-/-/0x00
      (enqueue) CF-00000000-00000000	DID: 0000-0007-00000004
      lv: 00 60 48 5e 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
      res: 0x70000024c605120, mode: SS, lock_flag: 0x0
      own: 0x70000024f5934a0, sess: 0x70000024f5934a0, proc: 0x70000024c3b20e0, prv: 0x70000024c605130
    ----------------------------------------

我们接着来看看arc1 进程的信息：

PROCESS 17:
  ----------------------------------------
  SO: 70000024c3b4840, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=17, calls cur/top: 70000024f5dc778/70000024f5dc778, flag: (2) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 9
              last post received-location: ksqrcl
              last process to post me: 70000024f3ab6a0 1 2
              last post sent: 0 0 9
              last post sent-location: ksqrcl
              last process posted by me: 70000024f3ab6a0 1 2
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 70000024c42cbc0
    O/S info: user: oracle, term: UNKNOWN, ospid: 299280
    OSD pid info: Unix process pid: 299280, image: oracle@lgcenjfsa (ARC1)


SO: 70000024c59e2f8, type: 4, owner: 70000024c3b4840, flag: INIT/-/-/0x00
    (session) sid: 541 trans: 0, creator: 70000024c3b4840, flag: (51) USR/- BSY/-/-/-/-/-
              DID: 0001-0011-00000002, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
    service name: SYS$BACKGROUND
    last wait for 'enq: CF - contention' blocking sess=0x0 seq=28186 wait_time=2929713 seconds since wait started=903
                name|mode=43460004, 0=0, operation=0
    Dumping Session Wait History
     for 'enq: CF - contention' count=1 wait_time=2929713  <== 这里的waitime其实是个累积值。
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929711
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929705
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929707
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929709
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929714
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929711
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929712
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929706
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929711
                name|mode=43460004, 0=0, operation=0


 ----------------------------------------
    SO: 70000024f5dc778, type: 3, owner: 70000024c3b4840, flag: INIT/-/-/0x00
    (call) sess: cur 70000024c59e2f8, rec 70000024c56b588, usr 70000024c59e2f8; depth: 0
      ----------------------------------------
      SO: 70000024f698a10, type: 5, owner: 70000024f5dc778, flag: INIT/-/-/0x00
      (enqueue) CF-00000000-00000000	DID: 0001-0011-00000002
      lv: 00 60 48 5e 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
      res: 0x70000024c605120, lock_flag: 0x0
      own: 0x70000024c59e2f8, sess: 0x70000024c59e2f8, prv: 0x70000024f698a40

这里我们可以看到 0x70000024c605120 正是ckpt持有的resouce。

最后我们来看看lgwr进程信息：

PROCESS 6:
  ----------------------------------------
  SO: 70000024f3a8f40, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=6, calls cur/top: 70000024f5da630/70000024f5da630, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 24
              last post received-location: ksasnd
              last process to post me: 70000024c3b1900 1 6
              last post sent: 109 0 4
              last post sent-location: kslpsr
              last process posted by me: 70000024f3af5a0 158 0
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 70000024c42cbc0
    O/S info: user: oracle, term: UNKNOWN, ospid: 413940
    OSD pid info: Unix process pid: 413940, image: oracle@lgcenjfsa (LGWR)


        ----------------------------------------
    SO: 70000024f6831e0, type: 5, owner: 70000024f3a8f40, flag: INIT/-/-/0x00
    (enqueue) RT-00000001-00000000	DID: 0000-0006-00000005
    lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
    res: 0x70000024c605450, mode: X, lock_flag: 0x0
    own: 0x70000024c5a4e00, sess: 0x70000024c5a4e00, proc: 0x70000024f3a8f40, prv: 0x70000024c605460
    (FOB) flags=2 fib=7000002487ef290 incno=0 pending i/o cnt=0
     fname=/data04/oradata/cn3fsa/ctlcn3fsa3.ctl
     fno=2 lblksz=16384 fsiz=1204
    (FOB) flags=2 fib=7000002487eeef0 incno=0 pending i/o cnt=0
     fname=/data03/oradata/cn3fsa/ctlcn3fsa2.ctl
     fno=1 lblksz=16384 fsiz=1204
    (FOB) flags=2 fib=7000002487eeb50 incno=0 pending i/o cnt=0
     fname=/data02/oradata/cn3fsa/ctlcn3fsa1.ctl
     fno=0 lblksz=16384 fsiz=1204
    ----------------------------------------
    SO: 70000024c6c2338, type: 11, owner: 70000024f3a8f40, flag: INIT/-/-/0x00
    (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 70000024f3a8f40,
                       event: 5, last message event: 13,
                       last message waited event: 13, messages read: 1
                       channel: (70000024c6e7790) scumnt mount lock
                                scope: 1, event: 15, last mesage event: 13,
                                publishers/subscribers: 0/13,
                                messages published: 1
    ----------------------------------------
    SO: 70000024c5a4e00, type: 4, owner: 70000024f3a8f40, flag: INIT/-/-/0x00
    (session) sid: 551 trans: 0, creator: 70000024f3a8f40, flag: (51) USR/- BSY/-/-/-/-/-
              DID: 0001-0006-00000006, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
    service name: SYS$BACKGROUND
    waiting for 'rdbms ipc message' blocking sess=0x0 seq=36384 wait_time=0 seconds since wait started=0
                timeout=129, =0, =0
    Dumping Session Wait History
     for 'control file sequential read' count=1 wait_time=4
                file#=0, block#=16, blocks=1
     for 'control file sequential read' count=1 wait_time=4
                file#=0, block#=13, blocks=1
     for 'control file sequential read' count=1 wait_time=4
                file#=0, block#=12, blocks=1
     for 'control file sequential read' count=1 wait_time=4
                file#=0, block#=10, blocks=1
     for 'control file sequential read' count=1 wait_time=5
                file#=2, block#=1, blocks=1
     for 'control file sequential read' count=1 wait_time=5
                file#=1, block#=1, blocks=1
     for 'control file sequential read' count=1 wait_time=8
                file#=0, block#=1, blocks=1
     for 'rdbms ipc message' count=1 wait_time=26490
                timeout=12c, =0, =0
     for 'control file sequential read' count=1 wait_time=12
                file#=0, block#=16, blocks=1
     for 'control file sequential read' count=1 wait_time=4
                file#=0, block#=13, blocks=1
    temporary object counter: 0

从trace前面的锁信息得出如下结果：

Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)
Possible local blocker ospid=524594 sid=550 sser=1 time_held=674 secs (ges mode req=6 held=4)

ckpt： hold  4, request 6
arch1: request 3,hold  6

综上所述，得出如下结论：

1. arch1 等待 ckpt (为了获得CF enqueue);
2. lgwr 在等待 arch1 去写日志;
3. arch1 等待时间超过 900s (从上面来看其实仅仅才674s);
4. lgwr 将进程 ckpt kill (因为ckpt进程阻塞arch进程，而lgwr通知arch1进程写归档，那么就间接是ckpt阻塞了lgwr进程);
5. database crash。

从alert来看，是因为log file切换有点过于频繁了，时间短的就30s左右。
根据oracle mos文档Database Crashes With ORA-00494 [ID 753290.1]的建议是这样的：

1. 调整redo log file，报错切换频率维持在 20~30 min;
2. 对于online logfile大小，可以参考 v$instance_recovery.optimal_logfile_size.

评论

Vicki Eichler

2012年3月17日

RightScale’s management tool on any platform is fantastic. The server templates really make deployments easy and manageable.

回复
Database crash with ORA-00494 » Oracle Notes

2012年3月18日

[…] http://www.killdb.com/2012/03/17/database-crash-with-ora-00494.html […]

回复
lizhenxu

2012年3月31日

For test!

回复

发表回复取消回复