database crash with ora-00494

database crash with ora-00494

群论坛一个网友的问题,原帖地址:


http://databasefans.com/read.php?tid=273

###### alert 如下 ######

Thu Mar 15 22:59:38 2012
Errors in file /app/oracle/admin/cn3fsa/bdump/cn3fsa_arc1_299280.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 524594'
Thu Mar 15 22:59:40 2012
System State dumped to trace file /app/oracle/admin/cn3fsa/bdump/cn3fsa_arc1_299280.trc
Thu Mar 15 22:59:41 2012
Thread 1 cannot allocate new log, sequence 167905
Checkpoint not complete
  Current log# 5 seq# 167904 mem# 0: /data02/oradata/cn3fsa/redo5a.rdo
  Current log# 5 seq# 167904 mem# 1: /data03/oradata/cn3fsa/redo5b.rdo
Thu Mar 15 22:59:47 2012
Thread 1 advanced to log sequence 167905 (LGWR switch)
  Current log# 7 seq# 167905 mem# 0: /data04/oradata/cn3fsa/redo7a.rdo
  Current log# 7 seq# 167905 mem# 1: /data01/oradata/cn3fsa/redo7b.rdo
Thu Mar 15 22:59:48 2012
Killing enqueue blocker (pid=524594) on resource CF-00000000-00000000
 by killing session 550.1
Thu Mar 15 22:59:50 2012
Errors in file /app/oracle/admin/cn3fsa/bdump/cn3fsa_ckpt_524594.trc:
ORA-00028: your session has been killed
Thu Mar 15 22:59:50 2012
CKPT: terminating instance due to error 28
Instance terminated by CKPT, pid = 524594
Fri Mar 16 08:50:50 2012
Starting ORACLE instance (normal)
sskgpgetexecname failed to get name
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 3
Autotune of undo retention is turned off.
IMODE=BR
ILAT =61
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:
  processes                = 500
  sessions                 = 555
从上面错误,我们可以判断是arch1无法获得CF锁,而阻塞者 524594 很可能是后台核心进行,以至于
被kill后导致实例crash了。

这里需要说明一下的是该机制是这样的:

当进行需要获取CF enqueue时,最大等待时间为900s,超过900s时,将强制kill阻塞进程。通过如下隐含
参数控制这个timeout.
SQL> show parameter controlfile_enq

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
_controlfile_enqueue_dump            boolean     FALSE
_controlfile_enqueue_holding_time    integer     120
_controlfile_enqueue_timeout         integer     900
_kill_controlfile_enqueue_blocker    boolean     TRUE
下面我们来看看进程 524594 是什么进程?

通过trace我们可以发现,如下:
Unix process pid: 524594, image: oracle@lgcenjfsa (CKPT)
我们可以发现,原来是ckpt进程被干掉了,导致数据库crash了,那么我们再进一步分析,
是什么进程kill ckpt进程了呢?从上面alert log来看,error是写到arch1 trace里面的,
所以这里我想应该是跟lgwr进程有关系。

从网友提供的trace来看,其中lgwr trace都是如下信息:
*** 2012-03-15 22:53:53.488
Warning: log write time 600ms, size 7893KB
*** 2012-03-15 22:54:00.669
Warning: log write time 540ms, size 7362KB
*** 2012-03-15 22:54:18.592
Warning: log write time 640ms, size 4040KB
*** 2012-03-15 22:54:19.182
Warning: log write time 590ms, size 7412KB
*** 2012-03-15 22:54:30.110
Warning: log write time 910ms, size 538KB
*** 2012-03-15 22:54:36.899
Warning: log write time 550ms, size 7595KB
*** 2012-03-15 22:54:48.679
Warning: log write time 720ms, size 10229KB
*** 2012-03-15 22:54:55.736
Warning: log write time 510ms, size 7491KB
*** 2012-03-15 22:55:07.782
Warning: log write time 830ms, size 4178KB
说明lgwr在不停写,且在等待。

从另外一个trace里面也可以看到如下信息:
*** 2012-03-15 22:43:48.114
Waited for detached process: CKPT for 310 seconds:
*** 2012-03-15 22:43:48.114
Dumping diagnostic information for CKPT:
OS pid = 524594

*** 2012-03-15 22:44:00.687
Waited for detached process: CKPT for 320 seconds:
*** 2012-03-15 22:44:00.687
Dumping diagnostic information for CKPT:
OS pid = 524594
我们知道检查点也是在不停的工作的,如果logfile切换频繁很可能会导致这个情况。
那么最后我们再来看看ckpt进程的状态信息:
*** 2012-03-15 22:59:38.881
Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)
Possible local blocker ospid=524594 sid=550 sser=1 time_held=674 secs (ges mode req=6 held=4)
DUMP LOCAL BLOCKER: initiate state dump for KILL BLOCKER
  possible owner[7.524594] on resource CF-00000000-00000000
Dumping process info of pid[7.524594] requested by pid[17.299280]
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 524594'
-------------------------------------------------------------------------------
ENQUEUE [CF] HELD FOR TOO LONG

enqueue holder: 'inst 1, osid 524594'

Process 'inst 1, osid 524594' is holding an enqueue for maximum allowed time.
The process will be terminated.

Oracle Support Services triaging information: to find the root-cause, look
at the call stack of process 'inst 1, osid 524594' located below. Ask the
developer that owns the first NON-service layer in the stack to investigate.
Common service layers are enqueues (ksq), latches (ksl), library cache
pins and locks (kgl), and row cache locks (kqr).

Dumping process 7.524594 info:
*** 2012-03-15 22:59:38.968
Dumping diagnostic information for ospid 524594:
OS pid = 524594
loadavg : 1.38 1.97 2.29
swap info: free_mem = 34.10M rsv = 64.00M
           alloc = 7237.58M avail = 16384.00M swap_free = 9146.42M
       F S      UID    PID   PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 T   oracle 524594      1   0  60 20 c1d0d590 108076            Mar 12      -  1:35 ora_ckpt_cn3fsa
open: Permission denied
524594: ora_ckpt_cn3fsa
0x000000010025d950  ksfdaio1(??, ??, ??) + 0x330
0x000000010022c988  kcflbi(??, ??, ??, ??, ??, ??, ??, ??) + 0x6a8
0x00000001010f0970  kcbldio(??, ??, ??) + 0x1eb0
0x00000001010f6468  kcblsltio(??, ??, ??, ??) + 0x548
0x00000001010e3b50  krhahra(??) + 0x310
0x00000001010e711c  krhahini(??, ??, ??, ??) + 0x57c
0x00000001018fc4fc  kcvcfsi(??, ??, ??, ??, ??, ??) + 0x31c
0x00000001018e1744  kcvucpb(??, ??, ??, ??, ??, ??, ??, ??) + 0x1a4
0x00000001018fa344  kcvcca(??, ??) + 0x104
0x000000010015e728  ksbcti(??, ??, ??) + 0x3c8
0x00000001001694fc  ksbabs(??) + 0x25c
0x0000000100166554  ksbrdp() + 0x4b4
0x000000010430c09c  opirip(??, ??, ??) + 0x3fc
0x0000000102d9a598  opidrv(??, ??, ??) + 0x458
0x000000010370b0b0  sou2o(??, ??, ??, ??) + 0x90
0x0000000100000870  opimai_real(??, ??) + 0x150
0x00000001000006d8  main(??, ??) + 0x98
0x0000000100000368  __start() + 0x98
*** 2012-03-15 22:59:40.936
----------------------------------------
SO: 70000024c3b20e0, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=7, calls cur/top: 70000024f5da8f0/70000024f5da8f0, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 114
              last post received-location: kcbbza
              last process to post me: 70000024c3b1900 1 6
              last post sent: 0 0 9
              last post sent-location: ksqrcl
              last process posted by me: 70000024c3b9700 106 2
  (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 70000024c42cbc0
    O/S info: user: oracle, term: UNKNOWN, ospid: 524594
    OSD pid info: Unix process pid: 524594, image: oracle@lgcenjfsa (CKPT)


SO: 70000024f5934a0, type: 4, owner: 70000024c3b20e0, flag: INIT/-/-/0x00
  (session) sid: 550 trans: 0, creator: 70000024c3b20e0, flag: (51) USR/- BSY/-/-/-/-/-
            DID: 0001-0007-00000005, short-term DID: 0000-0000-00000000
            txn branch: 0
            oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
  last wait for 'direct path write' blocking sess=0x0 seq=6421 wait_time=1 seconds since wait started=0
          file number=33, first dba=1, block cnt=1
  Dumping Session Wait History
   for 'direct path write' count=1 wait_time=1
          file number=33, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=1
          file number=34, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=0
          file number=35, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=0
          file number=36, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=0
          file number=37, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=1
          file number=38, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=0
          file number=39, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=0
          file number=3a, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=1
          file number=3b, first dba=1, block cnt=1
   for 'direct path write' count=1 wait_time=1
          file number=3c, first dba=1, block cnt=1

      service name: SYS$BACKGROUND
    waiting for 'rdbms ipc message' blocking sess=0x0 seq=6432 wait_time=0 seconds since wait started=0
                timeout=11c, =0, =0
    Dumping Session Wait History
     for 'rdbms ipc message' count=1 wait_time=52
                timeout=11d, =0, =0
     for 'rdbms ipc message' count=1 wait_time=150428
                timeout=12c, =0, =0
     for 'log buffer space' count=1 wait_time=4
                =0, =0, =0
     for 'control file sequential read' count=1 wait_time=18
                file#=0, block#=3, blocks=1
     for 'control file parallel write' count=1 wait_time=32552
                files=3, block#=3, requests=3
     for 'control file sequential read' count=1 wait_time=13
                file#=0, block#=1, blocks=1
     for 'control file parallel write' count=1 wait_time=14208
                files=3, block#=1, requests=3
     for 'control file parallel write' count=1 wait_time=16303
                files=3, block#=f, requests=3
     for 'control file parallel write' count=1 wait_time=13450
                files=3, block#=11, requests=3
     for 'control file parallel write' count=1 wait_time=22983
                files=3, block#=1a, requests=3

    ----------------------------------------
      SO: 70000024f682f80, type: 5, owner: 70000024f5da8f0, flag: INIT/-/-/0x00
      (enqueue) CF-00000000-00000000	DID: 0000-0007-00000004
      lv: 00 60 48 5e 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
      res: 0x70000024c605120, mode: SS, lock_flag: 0x0
      own: 0x70000024f5934a0, sess: 0x70000024f5934a0, proc: 0x70000024c3b20e0, prv: 0x70000024c605130
    ----------------------------------------
我们接着来看看arc1 进程的信息:
PROCESS 17:
  ----------------------------------------
  SO: 70000024c3b4840, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=17, calls cur/top: 70000024f5dc778/70000024f5dc778, flag: (2) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 9
              last post received-location: ksqrcl
              last process to post me: 70000024f3ab6a0 1 2
              last post sent: 0 0 9
              last post sent-location: ksqrcl
              last process posted by me: 70000024f3ab6a0 1 2
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 70000024c42cbc0
    O/S info: user: oracle, term: UNKNOWN, ospid: 299280
    OSD pid info: Unix process pid: 299280, image: oracle@lgcenjfsa (ARC1)


SO: 70000024c59e2f8, type: 4, owner: 70000024c3b4840, flag: INIT/-/-/0x00
    (session) sid: 541 trans: 0, creator: 70000024c3b4840, flag: (51) USR/- BSY/-/-/-/-/-
              DID: 0001-0011-00000002, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
    service name: SYS$BACKGROUND
    last wait for 'enq: CF - contention' blocking sess=0x0 seq=28186 wait_time=2929713 seconds since wait started=903
                name|mode=43460004, 0=0, operation=0
    Dumping Session Wait History
     for 'enq: CF - contention' count=1 wait_time=2929713  <== 这里的waitime其实是个累积值。
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929711
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929705
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929707
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929709
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929714
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929711
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929712
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929706
                name|mode=43460004, 0=0, operation=0
     for 'enq: CF - contention' count=1 wait_time=2929711
                name|mode=43460004, 0=0, operation=0


 ----------------------------------------
    SO: 70000024f5dc778, type: 3, owner: 70000024c3b4840, flag: INIT/-/-/0x00
    (call) sess: cur 70000024c59e2f8, rec 70000024c56b588, usr 70000024c59e2f8; depth: 0
      ----------------------------------------
      SO: 70000024f698a10, type: 5, owner: 70000024f5dc778, flag: INIT/-/-/0x00
      (enqueue) CF-00000000-00000000	DID: 0001-0011-00000002
      lv: 00 60 48 5e 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
      res: 0x70000024c605120, lock_flag: 0x0
      own: 0x70000024c59e2f8, sess: 0x70000024c59e2f8, prv: 0x70000024f698a40
这里我们可以看到 0x70000024c605120 正是ckpt持有的resouce。

最后我们来看看lgwr进程信息:
PROCESS 6:
  ----------------------------------------
  SO: 70000024f3a8f40, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=6, calls cur/top: 70000024f5da630/70000024f5da630, flag: (6) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 24
              last post received-location: ksasnd
              last process to post me: 70000024c3b1900 1 6
              last post sent: 109 0 4
              last post sent-location: kslpsr
              last process posted by me: 70000024f3af5a0 158 0
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: 70000024c42cbc0
    O/S info: user: oracle, term: UNKNOWN, ospid: 413940
    OSD pid info: Unix process pid: 413940, image: oracle@lgcenjfsa (LGWR)


        ----------------------------------------
    SO: 70000024f6831e0, type: 5, owner: 70000024f3a8f40, flag: INIT/-/-/0x00
    (enqueue) RT-00000001-00000000	DID: 0000-0006-00000005
    lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x2
    res: 0x70000024c605450, mode: X, lock_flag: 0x0
    own: 0x70000024c5a4e00, sess: 0x70000024c5a4e00, proc: 0x70000024f3a8f40, prv: 0x70000024c605460
    (FOB) flags=2 fib=7000002487ef290 incno=0 pending i/o cnt=0
     fname=/data04/oradata/cn3fsa/ctlcn3fsa3.ctl
     fno=2 lblksz=16384 fsiz=1204
    (FOB) flags=2 fib=7000002487eeef0 incno=0 pending i/o cnt=0
     fname=/data03/oradata/cn3fsa/ctlcn3fsa2.ctl
     fno=1 lblksz=16384 fsiz=1204
    (FOB) flags=2 fib=7000002487eeb50 incno=0 pending i/o cnt=0
     fname=/data02/oradata/cn3fsa/ctlcn3fsa1.ctl
     fno=0 lblksz=16384 fsiz=1204
    ----------------------------------------
    SO: 70000024c6c2338, type: 11, owner: 70000024f3a8f40, flag: INIT/-/-/0x00
    (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 70000024f3a8f40,
                       event: 5, last message event: 13,
                       last message waited event: 13, messages read: 1
                       channel: (70000024c6e7790) scumnt mount lock
                                scope: 1, event: 15, last mesage event: 13,
                                publishers/subscribers: 0/13,
                                messages published: 1
    ----------------------------------------
    SO: 70000024c5a4e00, type: 4, owner: 70000024f3a8f40, flag: INIT/-/-/0x00
    (session) sid: 551 trans: 0, creator: 70000024f3a8f40, flag: (51) USR/- BSY/-/-/-/-/-
              DID: 0001-0006-00000006, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
    service name: SYS$BACKGROUND
    waiting for 'rdbms ipc message' blocking sess=0x0 seq=36384 wait_time=0 seconds since wait started=0
                timeout=129, =0, =0
    Dumping Session Wait History
     for 'control file sequential read' count=1 wait_time=4
                file#=0, block#=16, blocks=1
     for 'control file sequential read' count=1 wait_time=4
                file#=0, block#=13, blocks=1
     for 'control file sequential read' count=1 wait_time=4
                file#=0, block#=12, blocks=1
     for 'control file sequential read' count=1 wait_time=4
                file#=0, block#=10, blocks=1
     for 'control file sequential read' count=1 wait_time=5
                file#=2, block#=1, blocks=1
     for 'control file sequential read' count=1 wait_time=5
                file#=1, block#=1, blocks=1
     for 'control file sequential read' count=1 wait_time=8
                file#=0, block#=1, blocks=1
     for 'rdbms ipc message' count=1 wait_time=26490
                timeout=12c, =0, =0
     for 'control file sequential read' count=1 wait_time=12
                file#=0, block#=16, blocks=1
     for 'control file sequential read' count=1 wait_time=4
                file#=0, block#=13, blocks=1
    temporary object counter: 0
从trace前面的锁信息得出如下结果:
Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6)
Possible local blocker ospid=524594 sid=550 sser=1 time_held=674 secs (ges mode req=6 held=4)

ckpt: hold  4, request 6
arch1: request 3,hold  6
综上所述,得出如下结论:

1. arch1 等待 ckpt (为了获得CF enqueue);
2. lgwr 在等待 arch1 去写日志;
3. arch1 等待时间超过 900s (从上面来看其实仅仅才674s);
4. lgwr 将进程 ckpt kill (因为ckpt进程阻塞arch进程,而lgwr通知arch1进程写归档,那么就间接是ckpt阻塞了lgwr进程);
5. database crash。

从alert来看,是因为log file切换有点过于频繁了,时间短的就30s左右。
根据oracle mos文档Database Crashes With ORA-00494 [ID 753290.1]的建议是这样的:

1. 调整redo log file,报错切换频率维持在 20~30 min;
2. 对于online logfile大小,可以参考 v$instance_recovery.optimal_logfile_size.

评论

  1. Vicki Eichler 的头像
    Vicki Eichler

    RightScale’s management tool on any platform is fantastic. The server templates really make deployments easy and manageable.

  2. lizhenxu 的头像
    lizhenxu

    For test!

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注